This talk takes a deep dive into Attend-and-Excite. The paper presents a method to guide text-to-image diffusion models to generate all subjects in the input prompt, to mitigate subject neglect. This is achieved by defining an intuitive loss over the cross-attention maps during inference without any additional data or fine-tuning.
In this half-day CVPR'23 tutorial, we present the state-of-the-art works on attention explainability and probing. We demonstrate how these mechanisms can be leveraged to guide diffusion models to edit and correct their generated images.
This paper presents a novel interpretability method for text-to-image diffusion models. The method uses the model's textual space to explain how diverse images are generated from text prompts. Given a textual concept (e.g., "a president"), the method generates exemplar images from the model, and learns to decompose the concept into a small set of interpretable tokens from the model's vocabulary, uncovering intriguing semantic connection, biases, and more.
This talk demonstrates how attention explainability can be used to improve model robustness and accuracy for image classification and generation tasks.
The paper presents a method to guide text-to-image diffusion models to generate all subjects in the input prompt, to mitigate subject neglect. This is achieved by defining an intuitive loss over the cross-attention maps during inference without any additional data or fine-tuning.