The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
SimPO: A Reference-free Preference Optimization
Copy link
Facebook
Email
Notes
More

SimPO: A Reference-free Preference Optimization

Aligning Llama 3 with human preferences

Benjamin Marie's avatar
Benjamin Marie
Jun 19, 2024
∙ Paid

Share this post

The Salt - Curated AI
The Salt - Curated AI
SimPO: A Reference-free Preference Optimization
Copy link
Facebook
Email
Notes
More
7
1
Share
A cartoon-style image of a table salt shaker with wings. The salt shaker should be smiling and have a friendly expression, with two white, feathered wings attached to its sides. The background should be simple and plain to keep the focus on the character.
Generated with DALL-E

DPO, IPO, KTO, ORPO, … There are now numerous alternatives to reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs) with human preferences.

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie
·
October 26, 2023
Read full story
ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

Benjamin Marie
·
April 8, 2024
Read full story

Their authors often claim that these methods perform on par with, or even better than, RLHF while being much simpler. For instance, RLHF requires training three different models (reference, reward, and policy), whereas DPO only requires training two (reference and policy). This significantly reduces the cost of LLM alignment while maintaining competitiveness with RLHF. Other methods, such as ORPO, go even further by requiring only one model and one training dataset, though they are more challenging to train due to sensitivity to hyperparameters and slower convergence.

SimPO is another alternative, notable for not requiring a reference model, making it a cost-effective and simple option compared to the popular DPO.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review SimPO. We will see what are its main advantages and differences with DPO. Then, we will experiment with SimPO training using Llama 3. SimPO is indeed cheaper than DPO while being faster for training than ORPO. It also performs surprisingly well.

I made a notebook showing how to train LLMs with SimPO:

Get the notebook (#7)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More