Why Can't We Compare the Perplexity of Two Different Models?
A very common evaluation mistake in scientific publications
Perplexity is the main evaluation metric for large language models (LLMs). It measures how well the model predicts a given sequence of tokens.
Formally, the perplexity of LLMs is the exponentiated average negative log-likelihood. During training, the LLM’s objective is to minimize this negative log-likelihood, hence perplexity is a very intuitive choice to evaluate the performance of an LLM. Note: A lower perplexity is better.
However, I’m reading, and reviewing for conferences, more and more papers comparing the perplexity of two different models, or worse comparing the perplexity of models on two different datasets.
If model A has a perplexity of 2.5 and model B has a perplexity of 2.1 on the same dataset, is A better than B?
We can’t answer based on their perplexity. In general, we can’t use the perplexity to compare two different models.
In this article, we will see, with examples, why perplexity can’t be used to compare two different LLMs. I also suggest some alternative metrics if you want to compare the performance of two LLMs.
The following notebook implements evaluation with perplexity for LLMs. It uses Llama 3 8B, Mistral 7B, and Gemma 2 9B as examples: