The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Magistral: Advancing Reasoning with Efficient GRPO Training

Magistral: Advancing Reasoning with Efficient GRPO Training

No More KL Penalty, No Need for a Reference Model

Benjamin Marie's avatar
Benjamin Marie
Jun 12, 2025
∙ Paid
1

Share this post

The Salt - Curated AI
The Salt - Curated AI
Magistral: Advancing Reasoning with Efficient GRPO Training
Share
Image generated with ChatGPT

Mistral AI released Magistral, its first reasoning model:

  • mistralai/Magistral-Small-2506 (Apache 2.0 license)

Based on Mistral Small 3.1, the model has 24B parameters and generates detailed reasoning traces before producing a concise final answer. It ranks among the strongest open reasoning models, and thanks to the efficient Mistral Small architecture, it also delivers fast inference.

source

To train it, Mistral AI used reinforcement learning with GRPO, optimizing the method for greater efficiency. They also investigated the role of supervised fine-tuning (SFT) prior to RL, comparing setups with and without distillation.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Full results and methodology are detailed in their technical report:

  • Magistral Technical Report

In this article, we will deep dive into the most insightful part of this report to understand how Mistral AI developed Magistral. I believe several modifications they made to GRPO can become standard to make it more efficient, as GRPO training can be very expensive.

Improving the Efficiency of GRPO: No More KL Penalty, No Need for a Reference Model

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share