Stepwise DPO and Ineffective Deeper Layers

The Weekly Salt #11

Apr 02, 2024

Reviewed this week

⭐Jamba: A Hybrid Transformer-Mamba Language Model
sDPO: Don't Use Your Data All at Once
Long-form factuality in large language models
The Unreasonable Ineffectiveness of the Deeper Layers

⭐: Papers that I particularly recommend reading.

New code repository:

I maintain a curated list of AI code repositories here:

Get the list

Jamba: A Hybrid Transformer-Mamba Language Model

State space models (SSMs), such as Mamba, offer more efficient training processes than Recurrent Neural Networks (RNNs) and excel in managing long-distance relationships within data. In this regard, their advantages are very similar to RWKV.

RWKV: As Good as the Transformer But Faster?

Benjamin Marie

February 13, 2024

Read full story

Among these alternatives to the transformer, Jamba is a new hybrid, blending Transformer and Mamba layers to take advantage of both, allowing for lower memory use, training efficiency, and the handling of extensive contexts.

Jamba also incorporates Mixture of Experts (MoE) layers, improving its capacity without proportionally increasing computational demand. This inclusion allows for the training of exceptionally large models. The application of MoE layers every other layer, with 16 experts each, significantly expands Jamba's parameter count while managing computational needs effectively.

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Benjamin Marie

December 12, 2023

Read full story

Evaluation across numerous benchmarks positions Jamba competitively against other large models, even outperforming some in long-context tasks, while maintaining efficiency. For instance, it achieves three times the throughput of comparably sized models in processing extensive contexts and can operate on a single GPU for contexts exceeding 128,000 tokens.

The model is available here:

ai21labs/Jamba-v0.1

Note: This new type of models might be a significant advance in AI. I’ll publish an extensive review of it later this month and I’ll try the model.

sDPO: Don't Use Your Data All at Once

To align LLMs with human preferences, in most cases, an instruct model (SFT) serves as the reference model in Direct Preference Optimization (DPO), despite being a suboptimal choice with possibly divergent preferences.

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie

October 26, 2023

Read full story

This SFT model sets a minimum standard for DPO, suggesting that a more closely aligned reference could enhance the effectiveness of DPO training by providing a stronger benchmark for alignment. One suggestion is to leverage existing open-source models that have been pre-aligned.

However, the practicality of this approach may be limited by the availability of such models or concerns over losing control over the reference model, which might raise safety issues. To address this, this work introduces a method called 'stepwise DPO' (sDPO), where preference data are applied progressively throughout DPO training. This approach ensures that the reference model for each training phase is the aligned model from the preceding phase, thereby improving the baseline for alignment at each step.

Their findings indicate that sDPO leads to the development of models with superior alignment.

Long-form factuality in large language models

In this paper, Google DeepMind introduces LongFact, a new set of prompts, alongside an evaluation method called SAFE and a new metric, F1@K, all designed to assess the long-form factuality of responses generated by LLMs.

They used GPT-4 to create LongFact, a collection of 2,280 prompts that aim at triggering the generation of long-form responses across 38 carefully chosen topics. LongFact is released here:

GitHub: LongFact

The SAFE method is introduced as a means to automatically evaluate the factuality of long-form responses. By breaking down a response into individual facts and using the language model to generate fact-checking queries for the Google Search API, they assess the validity of each fact based on search results.

SAFE demonstrates superior performance compared to human benchmarks, aligning with 72% of human annotations and prevailing in 76% of cases where there was initial disagreement, all while being significantly more cost-effective.

GitHub: SAFE

The F1@K metric measures long-form factuality, which allows for the adjustment of the number of facts in a response to match a human-preferred standard. This metric evaluates both the precision and recall of factual accuracy in responses.

In their experiments, they show that larger models tend to exhibit better performance in producing factually accurate long-form content.

The Unreasonable Ineffectiveness of the Deeper Layers

This work presents a straightforward pruning strategy for LLMs. It identifies layers for pruning based on the similarity of their representations, to reduce the model size while retaining performance.

Once these layers are pruned, it runs a short QLoRA fine-tuning to address any performance discrepancies caused by the pruning.

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

They found that significant portions of the deeper layers could be removed without severely impacting the model's performance. For instance, in the Llama-2-70B model, it is possible to eliminate nearly half of its layers before observing a notable performance drop. Since removing large sections of the network has little impact on performance, it suggests those removed sections were not needed.

They also checked the similarity between layers at different depths and discovered that deeper layers tend to resemble each other more closely than the more superficial ones, except for the very last layer, suggesting that LLMs may not be fully utilizing their deeper layers.

If you have any questions about one of these papers, write them in the comments. I will answer them.

The Salt - Curated AI

RWKV: As Good as the Transformer But Faster?

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

QLoRA: Fine-Tune a Large Language Model on Your GPU

Discussion about this post