Magistral: Advancing Reasoning with Efficient GRPO Training
No More KL Penalty, No Need for a Reference Model
Mistral AI released Magistral, its first reasoning model:
mistralai/Magistral-Small-2506 (Apache 2.0 license)
Based on Mistral Small 3.1, the model has 24B parameters and generates detailed reasoning traces before producing a concise final answer. It ranks among the strongest open reasoning models, and thanks to the efficient Mistral Small architecture, it also delivers fast inference.
To train it, Mistral AI used reinforcement learning with GRPO, optimizing the method for greater efficiency. They also investigated the role of supervised fine-tuning (SFT) prior to RL, comparing setups with and without distillation.
Full results and methodology are detailed in their technical report:
In this article, we will deep dive into the most insightful part of this report to understand how Mistral AI developed Magistral. I believe several modifications they made to GRPO can become standard to make it more efficient, as GRPO training can be very expensive.