DPO, IPO, KTO, ORPO, … There are now numerous alternatives to reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs) with human preferences.
Their authors often claim that these methods perform on par with, or even better than, RLHF while being much simpler. For instance, RLHF requires training three different models (reference, reward, and policy), whereas DPO only requires training two (reference and policy). This significantly reduces the cost of LLM alignment while maintaining competitiveness with RLHF. Other methods, such as ORPO, go even further by requiring only one model and one training dataset, though they are more challenging to train due to sensitivity to hyperparameters and slower convergence.
SimPO is another alternative, notable for not requiring a reference model, making it a cost-effective and simple option compared to the popular DPO.
In this article, I review SimPO. We will see what are its main advantages and differences with DPO. Then, we will experiment with SimPO training using Llama 3. SimPO is indeed cheaper than DPO while being faster for training than ORPO. It also performs surprisingly well.
I made a notebook showing how to train LLMs with SimPO: