TÜLU 3 8B and 70B achieve impressive performance using entirely open datasets. In a previous article, we explored these datasets and the process AI2 followed to create them.
The datasets were specifically designed for the post-training recipe that AI2 developed to transform Llama 3.1 into TÜLU 3. This post-training process can be broken down into three clear steps:
Supervised Fine-Tuning (SFT): Teaching the model how to respond to user prompts.
Direct Preference Optimization (DPO): Aligning the model with human preferences.
Reinforcement Learning with Verifiable Rewards (RLVR): Improving the model's ability to generate more accurate responses.
The combination of SFT + DPO has become a standard approach for LLMs post-training. AI2 has fully opened their methodology, providing valuable insights into the critical decisions that contribute to a successful SFT + DPO pipeline. Additionally, AI2 introduces RLVR further to refine the model's performance in particular domains.
In this article, we will examine the post-training recipe of TÜLU 3 in detail. Our goal is to understand the most impactful elements that make this recipe effective. We will also demonstrate how to replicate the SFT and DPO steps using standard libraries like Transformers and TRL. We will see how RLVR works and where it can be useful.
You can find my implementation using Transformers and TRL libraries for the SFT and DPO steps in the following notebook: