LongRoPE: Towards Unlimited Context Length for the Transformer
Experiments with up to 2 million tokens
Transformer models have a limited context size that can be too small for a wide range of applications, such as summarization, information retrieval, or in-context learning with numerous examples.
A transformer model can’t accurately model a context longer than the examples it has seen during training. We must increase the sequence length at training time to get better accuracy for longer sequences. However, this is often impractical due to the cost of training on long examples and the scarcity of long examples for training.
Several methods have been proposed to generalize beyond the sequence lengths seen during training. ALiBi (Attention with Linear Biases) and RoPE (Rotary Position Embedding) are among the most popular but still have severe limitations preventing them from dealing with a context of millions of tokens.
In this article, I review LongRoPE.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE is a recent work by Microsoft extending RoPE to far larger contexts. It shows promising performance by maintaining a low perplexity with context from 4,000 to 2 million tokens. LongRoPE can be applied to any LLMs trained with RoPE (e.g., Llama 2, Mistral 7B, Mixtral-7x8B).
We will see what the main limitations of current methods are and how LongRoPE improves RoPE for extending the LLM context window beyond 2 million tokens.
Since RoPE itself can be quite complex, the first section of this article is a short explainer of RoPE.