The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
TÜLU 3: The Post-Training Recipe
Copy link
Facebook
Email
Notes
More

TÜLU 3: The Post-Training Recipe

SFT + DPO + RLVR

Benjamin Marie's avatar
Benjamin Marie
Dec 19, 2024
∙ Paid
5

Share this post

The Salt - Curated AI
The Salt - Curated AI
TÜLU 3: The Post-Training Recipe
Copy link
Facebook
Email
Notes
More
Share
Generated with ChatGPT

TÜLU 3 8B and 70B achieve impressive performance using entirely open datasets. In a previous article, we explored these datasets and the process AI2 followed to create them.

TÜLU 3's High-Quality Synthetic Datasets for Post-Training LLMs

TÜLU 3's High-Quality Synthetic Datasets for Post-Training LLMs

Benjamin Marie
·
December 5, 2024
Read full story

The datasets were specifically designed for the post-training recipe that AI2 developed to transform Llama 3.1 into TÜLU 3. This post-training process can be broken down into three clear steps:

  1. Supervised Fine-Tuning (SFT): Teaching the model how to respond to user prompts.

  2. Direct Preference Optimization (DPO): Aligning the model with human preferences.

  3. Reinforcement Learning with Verifiable Rewards (RLVR): Improving the model's ability to generate more accurate responses.

The combination of SFT + DPO has become a standard approach for LLMs post-training. AI2 has fully opened their methodology, providing valuable insights into the critical decisions that contribute to a successful SFT + DPO pipeline. Additionally, AI2 introduces RLVR further to refine the model's performance in particular domains.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will examine the post-training recipe of TÜLU 3 in detail. Our goal is to understand the most impactful elements that make this recipe effective. We will also demonstrate how to replicate the SFT and DPO steps using standard libraries like Transformers and TRL. We will see how RLVR works and where it can be useful.

You can find my implementation using Transformers and TRL libraries for the SFT and DPO steps in the following notebook:

Get the notebook (#14)

Supervised Fine-Tuning with GPT-4o as a Teacher

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More