SimPO: A Reference-free Preference Optimization

Aligning Llama 3 with human preferences

Jun 19, 2024

∙ Paid

A cartoon-style image of a table salt shaker with wings. The salt shaker should be smiling and have a friendly expression, with two white, feathered wings attached to its sides. The background should be simple and plain to keep the focus on the character. — Generated with DALL-E

DPO, IPO, KTO, ORPO, … There are now numerous alternatives to reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs) with human preferences.

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie

October 26, 2023

Read full story

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

Benjamin Marie

April 8, 2024

Read full story

Their authors often claim that these methods perform on par with, or even better than, RLHF while being much simpler. For instance, RLHF requires training three different models (reference, reward, and policy), whereas DPO only requires training two (reference and policy). This significantly reduces the cost of LLM alignment while maintaining competitiveness with RLHF. Other methods, such as ORPO, go even further by requiring only one model and one training dataset, though they are more challenging to train due to sensitivity to hyperparameters and slower convergence.

SimPO is another alternative, notable for not requiring a reference model, making it a cost-effective and simple option compared to the popular DPO.

In this article, I review SimPO. We will see what are its main advantages and differences with DPO. Then, we will experiment with SimPO training using Llama 3. SimPO is indeed cheaper than DPO while being faster for training than ORPO. It also performs surprisingly well.

I made a notebook showing how to train LLMs with SimPO:

Get the notebook (#7)

The Salt - Curated AI

SimPO: A Reference-free Preference Optimization

Aligning Llama 3 with human preferences

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

This post is for paid subscribers