This paper presents a novel interpretability method for text-to-image diffusion models. The method uses the model’s textual space to explain how diverse images are generated from text prompts. Given a textual concept (e.g., “a president”), the method generates exemplar images from the model, and learns to decompose the concept into a small set of interpretable tokens from the …

The paper presents a method to guide text-to-image diffusion models to generate all subjects in the input prompt, to mitigate subject neglect. This is achieved by defining an intuitive loss over the cross-attention maps during inference without any additional data or fine-tuning.

Vision models are known to use “shortcuts” in the data, i.e. use irrelevant cues to achieve high accuracy. In this work, we show that using a short few-shot finetuning process on the relevance maps of ViTs, we can teach the model why the label is correct, and enforce that the predictions are based on the right reasons, resulting in a significant improvement in the robustness of ViTs.

This paper proposes a novel method to transfer the semantic properties that constitute high-level textual description from a target image to a source image, without changing the identity of the source. The method uses CLIP’s image latent space, which is more stable and expressive than the textual latent space.

The paper presents a novel use of explainability to perform zero-shot tasks such as image classification and generation. We demonstrate that CLIP guidance based on pure similarity scores between the image and text is unstable as the scores can be based on irrelevant or partial data. Our method demonstrates the effectiveness of using explainability to stabilize the scores.

The paper presents an interpretability method for all types of attention, including bi-modal Transformers and encoder-decoder Transformers. The method achieves SOTA results for CLIP, DETR, LXMERT, and more.

This paper presents an interpretability method for self-attention based models, and specifically for Transformer encoders. The method incorporates LRP and gradients, and achieves SOTA results for ViT, BERT, and DeiT.