The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

Making "state-of-the-art" LLMs

Benjamin Marie's avatar
Benjamin Marie
Mar 20, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?
1
Share
Generated with DALL-E

When they are released, large language models (LLMs) are (almost) always evaluated on the same benchmarks for commonsense reasoning, reading comprehension, general knowledge, etc. For instance, Winogrande, MMLU, GS8MK, and HellaSwag are almost always used. If an LLM obtains better scores on these benchmarks, on average, it will be considered better.

How credible are these scores?

Since most papers don’t reveal their evaluation settings. We can only trust the evaluation has been fairly conducted. Unfortunately, setting up an evaluation and comparing LLMs is an extremely difficult task. It is easy to manipulate hyperparameters and prompt formats to increase or decrease benchmark scores:

An In-Depth Evaluation of Gemini and Mixtral-8x7B

An In-Depth Evaluation of Gemini and Mixtral-8x7B

Benjamin Marie
·
January 29, 2024
Read full story

But that’s not all.

Given that most creators of LLMs do not reveal their training datasets, we can't verify whether the LLMs were trained on the benchmarks themselves, potentially resulting in inflated benchmark scores.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Training on evaluation data is referred to as “data leakage” or, especially in recent work, “data contamination”. The impact of potential data contamination in LLMs, particularly during pre-training, is poorly understood and widely understudied in the scientific literature.

In a hypothetical scenario where an LLM is trained on the same dataset used by benchmarks for evaluation, could we straightforwardly boost the benchmark scores for that specific LLM without suspicions?

In this article, I show a straightforward method for enhancing an LLM's scores on selected benchmarks through simple fine-tuning, while its performance on other benchmarks remains unaffected. For this demonstration, I used Mistral 7B and TinyLlama. This method can be reproduced on consumer hardware if you have a GPU with at least 16 GB of VRAM for Mistral 7B, or 6 GB of VRAM for TinyLlama.

A notebook reproducing my contaminated fine-tuning is available here:

Get the notebook (#2)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share