Online DPO with a Reward Model

Better than offline DPO, cheaper than reinforcement learning

Feb 04, 2025

∙ Paid

Direct Preference Optimization (DPO) is one of the most widely used methods for LLM preference optimization. However, standard DPO is an offline training method, and it has become increasingly clear that it underperforms compared to more advanced online reinforcement learning (RL) techniques like RLHF with PPO.

In recent months, an online variant of DPO has shown promising improvements over its offline counterpart, delivering better results while remaining significantly more cost-effective than full RL-based methods.

In this article, we’ll explore how online DPO works, how to train models using existing reward models, and how to implement online DPO with Qwen2.5 using the TRL framework.

The notebook below provides a hands-on guide to training Qwen2.5 models with online DPO:

Get the notebook (#15)

The Salt - Curated AI

Online DPO with a Reward Model

Better than offline DPO, cheaper than reinforcement learning

This post is for paid subscribers