Vision models are known to use “shortcuts” in the data, i.e. use irrelevant cues, such as the image background, to achieve high accuracy. For example, since snowplows often co-occur with snow, a model may learn to classify any vehicle in the snow as a snowplow. In this work, we show that using a very short and simple few-shot finetuning process on the relevance maps of a Vision Transformer, we can teach the model why the label is correct, and enforce that the predictions are based on the right reasons. We demonstrate a significant improvement in the robustness of the Vision Transformers (ViTs) to distribution shifts.