Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Making "state-of-the-art" LLMs

Mar 20, 2024

∙ Paid

When they are released, large language models (LLMs) are (almost) always evaluated on the same benchmarks for commonsense reasoning, reading comprehension, general knowledge, etc. For instance, Winogrande, MMLU, GS8MK, and HellaSwag are almost always used. If an LLM obtains better scores on these benchmarks, on average, it will be considered better.

How credible are these scores?

Since most papers don’t reveal their evaluation settings. We can only trust the evaluation has been fairly conducted. Unfortunately, setting up an evaluation and comparing LLMs is an extremely difficult task. It is easy to manipulate hyperparameters and prompt formats to increase or decrease benchmark scores:

An In-Depth Evaluation of Gemini and Mixtral-8x7B

Benjamin Marie

January 29, 2024

Read full story

But that’s not all.

Given that most creators of LLMs do not reveal their training datasets, we can't verify whether the LLMs were trained on the benchmarks themselves, potentially resulting in inflated benchmark scores.

Training on evaluation data is referred to as “data leakage” or, especially in recent work, “data contamination”. The impact of potential data contamination in LLMs, particularly during pre-training, is poorly understood and widely understudied in the scientific literature.

In a hypothetical scenario where an LLM is trained on the same dataset used by benchmarks for evaluation, could we straightforwardly boost the benchmark scores for that specific LLM without suspicions?

In this article, I show a straightforward method for enhancing an LLM's scores on selected benchmarks through simple fine-tuning, while its performance on other benchmarks remains unaffected. For this demonstration, I used Mistral 7B and TinyLlama. This method can be reproduced on consumer hardware if you have a GPU with at least 16 GB of VRAM for Mistral 7B, or 6 GB of VRAM for TinyLlama.

A notebook reproducing my contaminated fine-tuning is available here:

Get the notebook (#2)

The Salt - Curated AI

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Making "state-of-the-art" LLMs

An In-Depth Evaluation of Gemini and Mixtral-8x7B

This post is for paid subscribers