The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
An In-Depth Evaluation of Gemini and Mixtral-8x7B

An In-Depth Evaluation of Gemini and Mixtral-8x7B

Changing the hyperparameters is all you need

Benjamin Marie's avatar
Benjamin Marie
Jan 29, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
An In-Depth Evaluation of Gemini and Mixtral-8x7B
2
Share
Image from Pixabay

Gemini is not one but several chat models released by Google to compete with OpenAI’s GPT models. In the press release and technical report, Google highlights that the best Gemini models, Pro and Ultra, perform on par with GPT-3.5 and GPT-4, respectively.

To assess this performance, the Gemini models were evaluated on a selection of public benchmarks. While the evaluation results look impressive and demonstrate outstanding capabilities for the Gemini models, Google didn’t disclose any details on their evaluation settings: What were the prompts used to query each model? What hyperparameters were used for decoding? What version of GPT-3.5 and GPT-4 did they use?

The answers to these questions have a direct impact on the evaluation. It is well-known that prompts and decoding hyperparameters can significantly impact the quality of a model's outputs.

To evaluate this impact, Carnegie Mellon University (CMU) worked on an in-depth analysis of the Gemini models’ performance and compared it with the performance of GPT-3.5/GPT-4. All the parameters used for the evaluation are disclosed and their evaluation is reproducible. The analysis also includes a top-performing open LLM, Mixtral-7x8b, which was also claimed as good as GPT-3.5 by its creator, Mistral AI, without disclosing much about the evaluation settings.

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Benjamin Marie
·
December 12, 2023
Read full story

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I provide a detailed review and analysis of CMU’s assessment of Gemini's performance, alongside a comparative evaluation of its capabilities against those of GPT-3.5/4 and Mixtral, as presented in this paper:

An In-depth Look at Gemini's Language Abilities

In particular, we will see how CMU evaluated and compared all these models. We will see the impact of prompts and decoding hyperparameters on the evaluation metric scores. I first review the overall results and then task by task.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share