The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Why Can't We Compare the Perplexity of Two Different Models?

Why Can't We Compare the Perplexity of Two Different Models?

A very common evaluation mistake in scientific publications

Benjamin Marie's avatar
Benjamin Marie
Jul 24, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
Why Can't We Compare the Perplexity of Two Different Models?
1
4
Share
Generated with DALL-E

Perplexity is the main evaluation metric for large language models (LLMs). It measures how well the model predicts a given sequence of tokens.

Formally, the perplexity of LLMs is the exponentiated average negative log-likelihood. During training, the LLM’s objective is to minimize this negative log-likelihood, hence perplexity is a very intuitive choice to evaluate the performance of an LLM. Note: A lower perplexity is better.

However, I’m reading, and reviewing for conferences, more and more papers comparing the perplexity of two different models, or worse comparing the perplexity of models on two different datasets.

If model A has a perplexity of 2.5 and model B has a perplexity of 2.1 on the same dataset, is A better than B?

We can’t answer based on their perplexity. In general, we can’t use the perplexity to compare two different models.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will see, with examples, why perplexity can’t be used to compare two different LLMs. I also suggest some alternative metrics if you want to compare the performance of two LLMs.

The following notebook implements evaluation with perplexity for LLMs. It uses Llama 3 8B, Mistral 7B, and Gemma 2 9B as examples:

Get the notebook (#9)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share