Gemini is not one but several chat models released by Google to compete with OpenAI’s GPT models. In the press release and technical report, Google highlights that the best Gemini models, Pro and Ultra, perform on par with GPT-3.5 and GPT-4, respectively.
To assess this performance, the Gemini models were evaluated on a selection of public benchmarks. While the evaluation results look impressive and demonstrate outstanding capabilities for the Gemini models, Google didn’t disclose any details on their evaluation settings: What were the prompts used to query each model? What hyperparameters were used for decoding? What version of GPT-3.5 and GPT-4 did they use?
The answers to these questions have a direct impact on the evaluation. It is well-known that prompts and decoding hyperparameters can significantly impact the quality of a model's outputs.
To evaluate this impact, Carnegie Mellon University (CMU) worked on an in-depth analysis of the Gemini models’ performance and compared it with the performance of GPT-3.5/GPT-4. All the parameters used for the evaluation are disclosed and their evaluation is reproducible. The analysis also includes a top-performing open LLM, Mixtral-7x8b, which was also claimed as good as GPT-3.5 by its creator, Mistral AI, without disclosing much about the evaluation settings.
In this article, I provide a detailed review and analysis of CMU’s assessment of Gemini's performance, alongside a comparative evaluation of its capabilities against those of GPT-3.5/4 and Mixtral, as presented in this paper:
An In-depth Look at Gemini's Language Abilities
In particular, we will see how CMU evaluated and compared all these models. We will see the impact of prompts and decoding hyperparameters on the evaluation metric scores. I first review the overall results and then task by task.