RWKV: As Good as the Transformer But Faster?
A review of the RWKV neural architecture and Eagle 7B
The Transformer model is the most used neural architecture for two main reasons:
Training efficiency: The computation of the attention can be easily parallelized
Scalability: A Model with billions of parameters can learn from trillions of tokens
In addition, the transformer is very well supported by most deep learning libraries which makes the creation of new transformer models easy.
Nonetheless, the transformer is almost 7 years old, and recent usage with extremely long user inputs has started to expose some of its limits. Several alternatives have been proposed but none were able to preserve the transformer’s scaling ability and its training efficiency.
The RWKV architecture is an exception. Recent RWKV models with billions of parameters have demonstrated a performance similar to their transformer equivalent while being significantly more efficient for inference.
In this article, we will review the RWKV architecture to expose and understand its main advantages over the transformer. Then, we will review Eagle 7B, a multilingual RWKV model. We will see how to use it with Hugging Face Transformers.