Vision models are known to use "shortcuts" in the data, i.e. use irrelevant cues to achieve high accuracy. In this work, we show that using a short *few-shot* finetuning process on the relevance maps of ViTs, we can teach the model *why* the label is correct, and enforce that the predictions are based on the *right* reasons, resulting in a significant improvement in the robustness of ViTs.