The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Online DPO with a Reward Model
Copy link
Facebook
Email
Notes
More

Online DPO with a Reward Model

Better than offline DPO, cheaper than reinforcement learning

Benjamin Marie's avatar
Benjamin Marie
Feb 04, 2025
∙ Paid
6

Share this post

The Salt - Curated AI
The Salt - Curated AI
Online DPO with a Reward Model
Copy link
Facebook
Email
Notes
More
Share
Generated with ChatGPT

Direct Preference Optimization (DPO) is one of the most widely used methods for LLM preference optimization. However, standard DPO is an offline training method, and it has become increasingly clear that it underperforms compared to more advanced online reinforcement learning (RL) techniques like RLHF with PPO.

In recent months, an online variant of DPO has shown promising improvements over its offline counterpart, delivering better results while remaining significantly more cost-effective than full RL-based methods.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we’ll explore how online DPO works, how to train models using existing reward models, and how to implement online DPO with Qwen2.5 using the TRL framework.

The notebook below provides a hands-on guide to training Qwen2.5 models with online DPO:

Get the notebook (#15)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More