Online RLHF Is Still the Best Method for LLM Alignment

The Weekly Salt #18

May 21, 2024

Reviewed this week

Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
⭐Understanding the performance gap between online and offline alignment algorithms
LoRA Learns Less and Forgets Less
⭐RLHF Workflow: From Reward Modeling to Online RLHF

⭐: Papers that I particularly recommend reading.

The yearly subscription is now 35% off. This promotion is available until May 28th.

Get the discount

New code repositories:

I maintain a curated list of AI code repositories here:

Get the list

Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training

Contrastive learning is recognized as one of the most effective techniques for training text embedding models. It works by minimizing the distance between positive pairs and maximizing the distance between negative pairs to learn text semantic representations. In an article for The Kaitchup, we saw how contrastive learning can be used to turn Llama 3 into a text embedding model:

Turn Llama 3 into an Embedding Model with LLM2Vec

Benjamin Marie

April 29, 2024

Read full story

Recent advancements in text embedding largely rely on a two-stage pre-train/fine-tune pipeline. Pre-training uses weakly supervised data from large-scale web crawls, and fine-tuning involves refining the model with high-quality text pairs sourced through data mining or manual annotation.

This paper introduces Piccolo2, which employs a multi-task hybrid training method using textual data and labels of varying granularities. For instance, labels from semantic textual similarity (STS) tasks are typically more fine-grained than those from retrieval tasks.

The authors released their code here:

GitHub: hjq133/piccolo-embedding

⭐Understanding the performance gap between online and offline alignment algorithms

The key question addressed in this paper is whether online Reinforcement Learning (RL) is better than offline algorithms for AI alignment.

Empirically, offline algorithms are simpler and cheaper to implement compared to canonical online RL with Human Feedback (RLHF), which involves preference modeling and sampling. Demonstrating the sufficiency of offline algorithms could simplify the path to AI alignment. Conversely, demonstrating the advantages of online RLHF could highlight the fundamental role of online interactions and address key challenges in offline alignment.

Comparing online and offline algorithms presents challenges due to differences in implementation and computational demands. Online algorithms are typically more computationally intensive because they require sampling and training an additional reward model.

This study uses the KL divergence between the RLHF policy and a reference SFT policy as a measure of budget, allowing for a unified comparison across different algorithms and settings.

The results show that online algorithms generally outperform offline algorithms at the same optimization budget, measured by KL divergence. Online algorithms achieve higher peak performance across different levels of KL divergence, making them a Pareto improvement over offline algorithms.

LoRA Learns Less and Forgets Less

This study compares Low-Rank Adaptation (LoRA) and full finetuning for Llama-2 7B (and in some cases, 13B) models. Two training regimes are explored within each domain: instruction fine-tuning using question-answer datasets, and continued pre-training on billions of unlabeled tokens. The datasets used include Magicoder-Evol-Instruct-110K, MetaMathQA, StarCoder-Python, and OpenWebMath.

Performance in the target domains is evaluated using coding and math benchmarks (HumanEval and GSM8K), while performance in source domains is assessed on tasks related to language understanding, world knowledge, and common-sense reasoning.

Key findings include:

For code, LoRA significantly underperforms compared to full finetuning.
For math, LoRA performs closer to full finetuning but requires longer training.
LoRA better preserves source-domain performance compared to full finetuning.
Both LoRA and full finetuning exhibit a similar tradeoff between target-domain learning and source-domain forgetting.
In some cases, particularly for code, LoRA can achieve comparable learning with less forgetting.
LoRA provides stronger regularization than classic methods like dropout and weight decay, maintaining a diversity of solutions more similar to the base model.

The study also investigates why LoRA underperforms full finetuning. Full finetuning results in high-rank perturbations to the base model’s weight matrix, whereas LoRA's is low-rank, which may explain the performance gap in complex domains like coding and math.

Best practices for training with LoRA are proposed, highlighting its sensitivity to learning rates and the significant impact of target module selection and rank configuration on performance, which is not really surprising.

⭐RLHF Workflow: From Reward Modeling to Online RLHF

This technical report explains the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF). This is one of the best in-depth explanations that I have read on how to set up and run an online iterative RLHF. If you work with chat models, it’s a must-read.

Recent literature on LLMs indicates that this approach significantly outperforms its offline counterpart. However, most existing open-source RLHF projects are still primarily focused on offline learning.

Since online human feedback is generally impractical for open-source communities with limited resources, the authors begin by constructing preference models using a diverse range of open-source datasets. These models serve as proxies to approximate human feedback. The report then explores the theoretical foundations and algorithmic principles of online iterative RLHF, followed by a thorough practical implementation.

The trained model, SFR-Iterative-DPO-LLaMA-3-8B-R, exhibits impressive performance on various LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as academic benchmarks such as HumanEval and TruthfulQA. The findings demonstrate that supervised fine-tuning (SFT) combined with iterative RLHF can achieve state-of-the-art results using entirely open-source datasets. Furthermore, the models, curated datasets, and comprehensive step-by-step code guidebooks are publicly available.

They published their code here:

GitHub: RLHFlow/Online-RLHF

If you are interested in knowing the details of how RLHF works, have a look at this article:

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

September 21, 2023

Read full story

If you have any questions about one of these papers, write them in the comments. I will answer them.

The Salt - Curated AI

Turn Llama 3 into an Embedding Model with LLM2Vec

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

Discussion about this post