Multiagent Finetuning and Better RLHF

The Weekly Salt #51

Benjamin Marie

Jan 15, 2025

This week, I reviewed the following papers:

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
GeAR: Generation Augmented Retrieval
⭐Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Scaling Laws for Floating Point Quantization Training

⭐: Papers that I particularly recommend reading.

New code repositories (list of all repositories):

Self-improving LLMs with multiagent fine-tuning

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

This paper introduces a new approach to self-improvement for LLMs leveraging a multiagent framework. Instead of fine-tuning a single model, the proposed method fine-tunes multiple models from the same base, assigning each model a distinct role to promote specialization and diversity. Models are trained on independent subsets of generated data, allowing them to focus on specific aspects of tasks. The multiagent system has two key components: generation agents that produce initial responses and critic agents that evaluate and refine these responses. These agents engage in a structured feedback loop.

The proposed multiagent framework overcomes the limitations of single-agent self-improvement by enabling consistent gains across multiple rounds of fine-tuning. Their experiments in the paper demonstrate its effectiveness on various reasoning tasks. The results show significant performance improvements, including generalization to new datasets in zero-shot settings.

Local Agentic AI with smolagents and Qwen2.5 Coder

Benjamin Marie

Jan 13

Read full story

The authors released their code here:

GitHub: vsubramaniam851/multiagent-ft

GeAR: Generation Augmented Retrieval

Retrieval systems in RAG often rely on bi-encoder models that encode queries and documents separately into vector representations for similarity calculations. However, these models face significant challenges, including the limited expressiveness of scalar similarity scores, difficulty in identifying query-relevant sections in long documents, and the need for fine-grained understanding in tasks like sentence selection and search result highlighting.

To address these limitations, the paper introduces GeAR (Generation-Augmented Retrieval), a new approach that integrates retrieval and fine-grained text localization capabilities. GeAR utilizes a combination of contrastive learning and a text decoder to optimize query-document similarity while also generating fine-grained, query-relevant information from documents. This dual capability improves retrieval precision and helps users better interpret the results.

The authors highlight several challenges in implementing this approach, including a lack of existing data to support the method, the complexity of integrating retrieval and generation tasks, and the need for effective training strategies. To overcome these challenges, they developed a pipeline for data synthesis, model design, and training.

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

While RLHF methods, such as proximal policy optimization (PPO) with a bandit reward model, have been successful, they face issues like sparse rewards when assigning preferences to entire text sequences. This sparse reward problem increases gradient variance, lowers sample efficiency, and complicates reward assignment for sequential language generation.

The authors propose a segment-level reward model to mitigate these issues, which gives a middle ground between token-level and full-sequence reward assignment. Unlike token-level rewards, which are limited by the fine granularity of subword tokens, segment-level rewards provide feedback on semantically meaningful text segments, improving both the density and semantic relevance of feedback. Dynamic segmentation is achieved using entropy thresholds in predictive distributions, identifying segments where tokens contribute meaningfully to the sequence.

The segment-level reward model is trained using sequence-preference labels, aggregated into a parameterized sequence evaluation via Bradley-Terry loss. It is integrated into PPO-based policy optimization, with enhancements such as generalized reward normalization functions and within-segment reward interpolation to densify training signals further.

Experiments on RLHF benchmarks, including AlpacaEval 2.0, Arena-Hard, and MT-Bench, demonstrate that the segment-level reward model outperforms classical bandit and token-level reward approaches. The paper also presents several ablation studies that are interesting to understand why it improves the results.

Scaling Laws for Floating Point Quantization Training

The paper investigates scaling laws for floating-point quantized training in LLMs, focusing on improving predictions of model performance under various precision settings. It builds on previous scaling law research, which emphasized the influence of model size and trained token size but addresses limitations in those approaches by incorporating finer-grained factors such as the exponent, mantissa, and block size used in floating-point (FP) quantization.

bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU

Benjamin Marie

October 28, 2024

Read full story

The authors propose a new scaling law formula that accounts for general training loss impacts tied to model and data sizes, as well as additional losses stemming from low-precision information loss during FP quantized training. Their extensive experiments show that quantized weights exhibit a relatively minor effect on performance compared to activations, which are more sensitive during gradient computation. The findings also show that increasing data size indefinitely can degrade performance under low precision, while larger model sizes, higher precision settings, and smaller block sizes can improve the effective use of trained tokens.

The exponent bits in FP quantization contribute slightly more to performance than mantissa bits, with an optimal precision range between 4-8 bits for balancing cost and computational efficiency.

The Salt - Curated AI

Local Agentic AI with smolagents and Qwen2.5 Coder

bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU

Discussion about this post