Large language models (LLMs) are commonly evaluated using a wide range of benchmarks that test their reasoning, world knowledge, language understanding, coding abilities, and more. Benchmark scores are typically released alongside LLMs to demonstrate model performance and superiority.
Examples of public benchmarks include MMLU, BigBench Hard, Arc Challenge, and Winogrande, among others. Evaluations often use few-shot learning, where several examples (such as questions and answers from the benchmark dataset) are provided to help the model understand the task.
For example, the MMLU benchmark provides questions with four answer options. Typically, MMLU is evaluated in a 5-shot setting, meaning five example questions and their answers are presented to the model along with the new question to answer. Presumably, including these examples in the prompt helps the LLM recognize the task structure.
In this article, we'll explore both few-shot and zero-shot evaluation approaches and compare their outcomes across generative (MATH) and non-generative (MMLU) benchmarks, using the Evaluation Harness which is one of the most popular evaluation frameworks. We will see that while most benchmarks still use few-shot learning, even small language models perform sufficiently well in zero-shot configurations. When possible, transitioning to zero-shot evaluation could reduce evaluation costs and improve reproducibility.
I used Qwen2.5 1.5B to test MATH and MMLU zero-shot and few-shot settings. All examples and results discussed are available in this accompanying notebook: