An In-Depth Evaluation of Gemini and Mixtral-8x7B

Changing the hyperparameters is all you need

Jan 29, 2024

∙ Paid

Gemini is not one but several chat models released by Google to compete with OpenAI’s GPT models. In the press release and technical report, Google highlights that the best Gemini models, Pro and Ultra, perform on par with GPT-3.5 and GPT-4, respectively.

To assess this performance, the Gemini models were evaluated on a selection of public benchmarks. While the evaluation results look impressive and demonstrate outstanding capabilities for the Gemini models, Google didn’t disclose any details on their evaluation settings: What were the prompts used to query each model? What hyperparameters were used for decoding? What version of GPT-3.5 and GPT-4 did they use?

The answers to these questions have a direct impact on the evaluation. It is well-known that prompts and decoding hyperparameters can significantly impact the quality of a model's outputs.

To evaluate this impact, Carnegie Mellon University (CMU) worked on an in-depth analysis of the Gemini models’ performance and compared it with the performance of GPT-3.5/GPT-4. All the parameters used for the evaluation are disclosed and their evaluation is reproducible. The analysis also includes a top-performing open LLM, Mixtral-7x8b, which was also claimed as good as GPT-3.5 by its creator, Mistral AI, without disclosing much about the evaluation settings.

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Benjamin Marie

December 12, 2023

Read full story

In this article, I provide a detailed review and analysis of CMU’s assessment of Gemini's performance, alongside a comparative evaluation of its capabilities against those of GPT-3.5/4 and Mixtral, as presented in this paper:

An In-depth Look at Gemini's Language Abilities

In particular, we will see how CMU evaluated and compared all these models. We will see the impact of prompts and decoding hyperparameters on the evaluation metric scores. I first review the overall results and then task by task.

The Salt - Curated AI

An In-Depth Evaluation of Gemini and Mixtral-8x7B

Changing the hyperparameters is all you need

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

This post is for paid subscribers