Magistral: Advancing Reasoning with Efficient GRPO Training

No More KL Penalty, No Need for a Reference Model

Jun 12, 2025

∙ Paid

Mistral AI released Magistral, its first reasoning model:

mistralai/Magistral-Small-2506 (Apache 2.0 license)

Based on Mistral Small 3.1, the model has 24B parameters and generates detailed reasoning traces before producing a concise final answer. It ranks among the strongest open reasoning models, and thanks to the efficient Mistral Small architecture, it also delivers fast inference.

To train it, Mistral AI used reinforcement learning with GRPO, optimizing the method for greater efficiency. They also investigated the role of supervised fine-tuning (SFT) prior to RL, comparing setups with and without distillation.

Full results and methodology are detailed in their technical report:

Magistral Technical Report

In this article, we will deep dive into the most insightful part of this report to understand how Mistral AI developed Magistral. I believe several modifications they made to GRPO can become standard to make it more efficient, as GRPO training can be very expensive.

The Salt - Curated AI

Magistral: Advancing Reasoning with Efficient GRPO Training

No More KL Penalty, No Need for a Reference Model

Improving the Efficiency of GRPO: No More KL Penalty, No Need for a Reference Model

This post is for paid subscribers