Optimizing Relevance Maps of Vision Transformers Improves Robustness

Vision models are known to use "shortcuts" in the data, i.e. use irrelevant cues to achieve high accuracy. In this work, we show that using a short *few-shot* finetuning process on the relevance maps of ViTs, we can teach the model *why* the label is correct, and enforce that the predictions are based on the *right* reasons, resulting in a significant improvement in the robustness of ViTs.