Vision models are known to use “shortcuts” in the data, i.e. use irrelevant cues to achieve high accuracy. In this work, we show that using a short few-shot finetuning process on the relevance maps of ViTs, we can teach the model why the label is correct, and enforce that the predictions are based on the right reasons, resulting in a significant improvement in the robustness of ViTs.