Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?
Making "state-of-the-art" LLMs
When they are released, large language models (LLMs) are (almost) always evaluated on the same benchmarks for commonsense reasoning, reading comprehension, general knowledge, etc. For instance, Winogrande, MMLU, GS8MK, and HellaSwag are almost always used. If an LLM obtains better scores on these benchmarks, on average, it will be considered better.
How credible are these scores?
Since most papers don’t reveal their evaluation settings. We can only trust the evaluation has been fairly conducted. Unfortunately, setting up an evaluation and comparing LLMs is an extremely difficult task. It is easy to manipulate hyperparameters and prompt formats to increase or decrease benchmark scores:
But that’s not all.
Given that most creators of LLMs do not reveal their training datasets, we can't verify whether the LLMs were trained on the benchmarks themselves, potentially resulting in inflated benchmark scores.
Training on evaluation data is referred to as “data leakage” or, especially in recent work, “data contamination”. The impact of potential data contamination in LLMs, particularly during pre-training, is poorly understood and widely understudied in the scientific literature.
In a hypothetical scenario where an LLM is trained on the same dataset used by benchmarks for evaluation, could we straightforwardly boost the benchmark scores for that specific LLM without suspicions?
In this article, I show a straightforward method for enhancing an LLM's scores on selected benchmarks through simple fine-tuning, while its performance on other benchmarks remains unaffected. For this demonstration, I used Mistral 7B and TinyLlama. This method can be reproduced on consumer hardware if you have a GPU with at least 16 GB of VRAM for Mistral 7B, or 6 GB of VRAM for TinyLlama.
A notebook reproducing my contaminated fine-tuning is available here: