The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Go Zero-Shot for Cheaper LLM Evaluations

Go Zero-Shot for Cheaper LLM Evaluations

Unless you use a generative benchmark

Benjamin Marie's avatar
Benjamin Marie
Nov 06, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
Go Zero-Shot for Cheaper LLM Evaluations
1
Share
Generated with ChatGPT

Large language models (LLMs) are commonly evaluated using a wide range of benchmarks that test their reasoning, world knowledge, language understanding, coding abilities, and more. Benchmark scores are typically released alongside LLMs to demonstrate model performance and superiority.

Examples of public benchmarks include MMLU, BigBench Hard, Arc Challenge, and Winogrande, among others. Evaluations often use few-shot learning, where several examples (such as questions and answers from the benchmark dataset) are provided to help the model understand the task.

For example, the MMLU benchmark provides questions with four answer options. Typically, MMLU is evaluated in a 5-shot setting, meaning five example questions and their answers are presented to the model along with the new question to answer. Presumably, including these examples in the prompt helps the LLM recognize the task structure.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we'll explore both few-shot and zero-shot evaluation approaches and compare their outcomes across generative (MATH) and non-generative (MMLU) benchmarks, using the Evaluation Harness which is one of the most popular evaluation frameworks. We will see that while most benchmarks still use few-shot learning, even small language models perform sufficiently well in zero-shot configurations. When possible, transitioning to zero-shot evaluation could reduce evaluation costs and improve reproducibility.

I used Qwen2.5 1.5B to test MATH and MMLU zero-shot and few-shot settings. All examples and results discussed are available in this accompanying notebook:

Get the notebook (#13)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share