How Generative LLMs Achieve Top MMLU Scores without Generating Anything
what you think MMLU evaluates ≠ what MMLU really evaluates
Large language models (LLMs) are typically evaluated and compared using public benchmarks designed to measure their accuracy on specific tasks and domains. The prevailing assumption is that an LLM with higher accuracy on these benchmarks is a better model.
Currently, one of the most commonly used benchmarks for evaluating generative LLMs is MMLU (Massive Multitask Language Understanding) or its more recent and challenging variant, MMLU-Pro. A strong performance on MMLU is often interpreted as a reliable indicator of a model’s overall capability.
But what exactly does MMLU evaluate?
In this article, I will first provide a brief overview of the MMLU and MMLU-Pro benchmarks. Then, we will explore how accuracy scores are calculated using these benchmarks. Notably, although these benchmarks are used to evaluate generative LLMs, MMLU doesn’t actually require the models to generate anything to achieve a perfect score. In other words, an LLM could score highly on MMLU even if it produces nonsensical outputs in other contexts. We will also examine a “generative” variant of MMLU, which may offer a more comprehensive assessment of an LLM’s performance but at a much higher cost.
I made the following notebook to look into the MMLU data and run the MMLU benchmarks, including its “generative” variant, for LLMs: