Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual concept (e.g., "a doctor", "love"). However, the internal process of mapping text to a rich visual representation remains an enigma. In this work, we tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation.
Given a text-to-image diffusion model (e.g., Stable Diffusion), and the concept of interest (e.g., a president), Conceptor learns to decompose the concept into a small set of human-interpretable tokens from the model's vocabulary. This is achieved by mimicking the training process of the model. (1) We extract a training set of 100 images from the model using the concept prompt. (2) A learned MLP network maps each word embedding wi from the model's vocabulary to its coefficient f(wi). (3) We calculate the learned pseudo-token w*N as a linear combination of the tokens in the vocabulary weighted by their learned coefficients. (4) We sample a random noise for each of the training images and noise the images accordingly. (5) Using the learned pseudo-token w*N, we apply the model's UNet to predict the added noise for each image. (6) We compute two loss functions. Lreconstruct encourages the token w*N to reconstruct the training images, and Lsparsity encourages the learned coefficients to be sparse. (7) In inference time, we only consider the top 50 tokens rated by their coefficients to reconstruct the concept images.
Given a single image generated for the concept, Conceptor can decompose the image into its own set of corresponding tokens that cause the generation. We find that the model learns to rely on non-trivial, semantic connections between concepts. For example, a snake is decomposed to a hose and a gecko such that the shape of its body is borrowed from the hose, and its head and skin texture are borrowed from the gecko.
We conduct experiments using concepts with a dual meaning (e.g., a crane is a type of bird and also a construction tool) and manipulate the token that controls one of the meanings. We observe that even when only one object is generated, the image relies on both meanings of the concept (e.g., the crane is shaped like a bird's head). These examples demonstrate that the model entangles both meanings of the concept semantically to create the resulting image.
Our method enables fine-grained concept manipulation by modifying the coefficient corresponding to a token of interest. For example, by manipulating the coefficient corresponding to the token abstract in the decomposition of the concept sculpture, we can make an input sculpture more or less abstract.
If you find this project useful for your research, please cite the following:
@article{chefer2023hidden,
title={The Hidden Language of Diffusion Models},
author={Chefer, Hila and Lang, Oran and Geva, Mor and Polosukhin, Volodymyr and Shocher, Assaf and Irani, Michal and Mosseri, Inbar and Wolf, Lior},
journal={arXiv preprint arXiv:2306.00966},
year={2023}
}