The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
How Generative LLMs Achieve Top MMLU Scores without Generating Anything

How Generative LLMs Achieve Top MMLU Scores without Generating Anything

what you think MMLU evaluates ≠ what MMLU really evaluates

Benjamin Marie's avatar
Benjamin Marie
Aug 07, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
How Generative LLMs Achieve Top MMLU Scores without Generating Anything
1
Share
Generated with DALL-E

Large language models (LLMs) are typically evaluated and compared using public benchmarks designed to measure their accuracy on specific tasks and domains. The prevailing assumption is that an LLM with higher accuracy on these benchmarks is a better model.

Currently, one of the most commonly used benchmarks for evaluating generative LLMs is MMLU (Massive Multitask Language Understanding) or its more recent and challenging variant, MMLU-Pro. A strong performance on MMLU is often interpreted as a reliable indicator of a model’s overall capability.

But what exactly does MMLU evaluate?

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I will first provide a brief overview of the MMLU and MMLU-Pro benchmarks. Then, we will explore how accuracy scores are calculated using these benchmarks. Notably, although these benchmarks are used to evaluate generative LLMs, MMLU doesn’t actually require the models to generate anything to achieve a perfect score. In other words, an LLM could score highly on MMLU even if it produces nonsensical outputs in other contexts. We will also examine a “generative” variant of MMLU, which may offer a more comprehensive assessment of an LLM’s performance but at a much higher cost.

I made the following notebook to look into the MMLU data and run the MMLU benchmarks, including its “generative” variant, for LLMs:

Get the notebook (#10)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share