Better Mamba and Anoter RoPE Improvement

The Weekly Salt #8

Mar 12, 2024

In The Weekly Salt, I review and analyze interesting AI papers published last week in plain English.

Reviewed this week

Resonance RoPE: Improving Context Length Generalization of Large Language Models
DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
⭐GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Teaching Large Language Models to Reason with Reinforcement Learning

⭐: Papers that I particularly recommend reading.

New code repository:

I maintain a curated list of AI code repositories here:

Get the list

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Extending the context length of LLMs is a very active research area with new ideas published every week. Last week, I reviewed Microsoft’s LongRoPE which extends the context of LLMs to over 2 million tokens:

LongRoPE: Towards Unlimited Context Length for the Transformer

Benjamin Marie

March 6, 2024

LongRoPE: Towards Unlimited Context Length for the Transformer

Transformer models have a limited context size that can be too small for a wide range of applications, such as summarization, information retrieval, or in-context learning with numerous examples. A transformer model can’t accurately model a context longer than the examples it has seen during training. We must increase the sequence length at training time to get better accuracy for longer sequences. However, this is often impractical due to the cost of training on long examples and the scarcity of long examples for training.

Read full story

This week, a new work proposes Resonance RoPE, yet another approach building upon RoPE. This new approach improves the accuracy of position embeddings in tasks involving training on short texts and testing on longer ones (TSTL).

They identify that while minimizing errors in out-of-distribution (OOD) positions is important, it is equally crucial to refine the interpolation of position embedding features at these OOD positions.

Like LongRoPE, Resonance RoPE successfully reduces the generalization gap for a significant portion of position embedding features in language models in TSTL scenarios. This method is also fully compatible with existing RoPE and RoPE-based scaling solutions, boosting their effectiveness in TSTL cases without requiring extra computing power during training or inference phases.

They have only evaluated for rather small context sizes (less than 50k tokens) but show that Resonance RoPE consistently improves over previous work.

The code is published on GitHub:

GitHub: sheryc/resonance_rope

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

The Transformer network's multi-head self-attention is essential but requires considerable computational power and memory, especially during inference. Recent architectural advancements, such as RWKV, aim to simplify Transformers by reducing computational and memory complexities.

RWKV: As Good as the Transformer But Faster?

Benjamin Marie

February 13, 2024

Read full story

State space models (SSM), in particular, use hidden states designed to manage long-range dependencies efficiently, allowing for parallelized training and efficient inference. Unlike traditional neural networks, SSMs leverage hidden states to carry information through time, reducing the computational cost. However, prior SSM designs limited hidden state flow to within the same layer, missing out on capturing hierarchical information across layers.

This paper introduces DenseSSM to enhance the flow of hidden information between layers. By addressing the issue of hidden state degradation and integrating shallow-layer hidden states into deeper layers, DenseSSM preserves detailed information beneficial for the final output across various SSM types. This method keeps the benefits of SSMs, like parallel training and efficient inference, while offering substantial performance gains with a minimal increase in parameters.

In the experiments, this approach outperforms the conventional RetNet by up to 5% accuracy on standard benchmarks.

They released the code to pre-train models here:

WailordHe/DenseSSM

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

For fine-tuning, LoRA has not achieved the performance levels of full-rank fine-tuning. It can, but often LoRA won’t be as good as full fine-tuning. For pre-training from scratch with LoRA, previous work, e.g., ReLoRA, has shown that LoRA necessitates initial full-rank model training, which serves as a warm-up before transitioning to optimization within a low-rank subspace. This limitation could be because optimal weight matrices may inherently not be low-rank or that reparameterization alters the gradient training dynamics.

ReLoRa: Pre-train a Large Language Model on Your GPU

Benjamin Marie

July 19, 2023

Read full story

To overcome these challenges, this work introduces Gradient Low-rank Projection (GaLore), a new approach enabling full-parameter learning while being more memory-efficient than traditional low-rank adaptation methods, like LoRA. GaLore focuses on exploiting the inherently low-rank structure of weight matrix gradients over time. It achieves this by applying two projection matrices to transform the gradient matrix into a low-rank form, thus significantly reducing memory usage associated with optimizer states. This method allows for occasional, computationally inexpensive updates to the projection matrices, offering up to 30% memory savings during pre-training compared to LoRA.

GaLore seems to be effective in both pre-training and fine-tuning. For example, pre-training a LLaMA-style 7B model on the C4 dataset with GaLore, alongside 8-bit optimization techniques, matches the performance of full-rank methods with substantially less memory—enabling such training on a single 24GB GPU without external memory offloading.

Additionally, when applied to fine-tune models on tasks like the GLUE benchmarks, GaLore outperforms other low-rank methods, including LoRA, in terms of average score.

As a gradient projection strategy, GaLore is compatible with a variety of optimizers and requires minimal code to integrate.

The official implementation of this “work in progress” is available here:

GitHub: jiaweizzhao/GaLore

Teaching Large Language Models to Reason with Reinforcement Learning

This research by Meta explores how reinforcement learning (RL) techniques can boost LLM reasoning across various reward schemes and model initializations, employing tasks defined by question-answer tuples.

They show that Expert Iteration (EI) consistently outperforms other RL algorithms in most scenarios, showing surprising sample efficiency comparable to more complex algorithms like Proximal Policy Optimization (PPO).

Their analysis reveals that the deterministic nature of the tasks and a lack of sophisticated exploration during RL fine-tuning contribute to the competitive performance of EI and return-conditioned RL against PPO. It emphasizes exploration as a critical area for future RL fine-tuning improvements.

If you have any questions about one of these papers, write them in the comments. I will answer them.

The Salt - Curated AI

LongRoPE: Towards Unlimited Context Length for the Transformer

RWKV: As Good as the Transformer But Faster?

ReLoRa: Pre-train a Large Language Model on Your GPU

Discussion about this post