Vocabulary Parallelism for More Efficient LLMs

The Weekly Salt #44

Benjamin Marie

Nov 19, 2024

Reviewed this week:

⭐Balancing Pipeline Parallelism with Vocabulary Parallelism
DELIFT: Data Efficient Language model Instruction Fine Tuning
Counterfactual Generation from Language Models

⭐: Papers that I particularly recommend reading.

For Black Friday, I’m offering a 30% discount on the yearly subscription to The Kaitchup:

Get the 30% discount

With this subscription, you get instant access to all the AI notebooks (120+) and all the articles and tutorials (200+).

The same discount also applies to The Salt:

Get the 30% discount (The Salt)

These offers expire on November 30th.

New code repositories (list of all repositories):

Vocabulary Parallelism for More Efficient Pipeline Parallelism

⭐Balancing Pipeline Parallelism with Vocabulary Parallelism

As transformer models continue to scale, model parallelism techniques have become standard to manage their computational and memory costs. Key approaches include Zero Redundancy Optimizer (ZeRO), Tensor Parallelism (TP), and Pipeline Parallelism (PP), each with specific benefits and trade-offs. ZeRO minimizes memory by removing redundant parameter storage but faces communication challenges. TP distributes parameters across devices but often also requires substantial inter-device communication. PP is particularly efficient due to low communication costs and high arithmetic intensity, making it well-suited for large models. However, PP struggles with pipeline bubbles—idle periods that reduce computational efficiency—and high memory consumption, especially when storing activations.

This paper identifies an overlooked issue in PP: the imbalanced computation and memory load caused by vocabulary-related layers. When input and output layers are placed only in the first and last stages of the pipeline, it creates workload imbalances that lead to inefficiencies and memory bottlenecks, especially as vocabulary sizes increase.

To solve this, the authors propose Vocabulary Parallelism, which distributes vocabulary layers across pipeline stages to balance computation and memory usage more effectively. Through optimized scheduling, Vocabulary Parallelism minimizes pipeline inefficiencies with minimal memory overhead. Experiments show this approach improves throughput by up to 51%.

The authors released their code:

GitHub: sail-sg/VocabularyParallelism

DELIFT: Data Efficient Language model Instruction Fine Tuning

DELIFT (Data Efficient Language model Instruction Fine-Tuning) is an algorithm developed to streamline the data selection process across all fine-tuning stages. It does this by introducing a pairwise utility metric, which measures the informational value of each data sample based on the model's current knowledge and its needs for specific tasks. DELIFT prioritizes the data that will have the most impact and thus will reduce the total data required without compromising the model’s performance.

The fine-tuning process has three main stages that DELIFT addresses:

Instruction Tuning: DELIFT selects diverse data to improve the model’s general instruction-following abilities.
Task-Specific Fine-Tuning: It focuses on task-relevant data.
Continual Fine-Tuning: It identifies and incorporates new information to expand the model's knowledge while protecting previously learned skills.

In each stage, DELIFT uses submodular optimization techniques to pick data that balances learning potential with computational efficiency. This means it can achieve high levels of learning while using fewer resources. In In-Context Learning (ICL) settings, DELIFT can choose examples that improve performance as effectively as using the entire dataset to further reduce data requirements.

It achieves up to a 70% reduction in data requirements and computational time compared to existing methods, while empirical results show it performs up to 26% better than standard approaches on various tasks.

Counterfactual Generation from Language Models

This study explores how we interpret language models (LMs) by using concepts from causal reasoning. It addresses three main levels of causality: association (observing patterns), intervention (changing variables to see effects), and counterfactuality (imagining what could happen differently). Current LM research struggles to define counterfactuality accurately, so this paper uses structural equation modeling to create a clear approach for LMs.

To examine how changes affect LMs, the authors modified parts of the model, like adjusting specific concepts (e.g., gender) by targeting certain parts of the model’s internal structure. However, these changes don’t let us fully explore “what-if” scenarios. To solve this, the researchers applied a new method that separates predictable parts of the model from random elements, letting them generate real counterfactuals and alternative text completions based on different conditions.

Testing this on models like GPT-2 XL and Llama 3 8B shows that targeted adjustments, like changing knowledge or steering language, often create unexpected side effects. For example, a gender-related change can influence unrelated text, proving it’s hard to control LMs with precise interventions.

The Salt - Curated AI

Discussion about this post