The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
RWKV: As Good as the Transformer But Faster?
Copy link
Facebook
Email
Notes
More

RWKV: As Good as the Transformer But Faster?

A review of the RWKV neural architecture and Eagle 7B

Benjamin Marie's avatar
Benjamin Marie
Feb 13, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
RWKV: As Good as the Transformer But Faster?
Copy link
Facebook
Email
Notes
More
5
Share

The Transformer model is the most used neural architecture for two main reasons:

  • Training efficiency: The computation of the attention can be easily parallelized

  • Scalability: A Model with billions of parameters can learn from trillions of tokens

In addition, the transformer is very well supported by most deep learning libraries which makes the creation of new transformer models easy.

Nonetheless, the transformer is almost 7 years old, and recent usage with extremely long user inputs has started to expose some of its limits. Several alternatives have been proposed but none were able to preserve the transformer’s scaling ability and its training efficiency.

The RWKV architecture is an exception. Recent RWKV models with billions of parameters have demonstrated a performance similar to their transformer equivalent while being significantly more efficient for inference.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will review the RWKV architecture to expose and understand its main advantages over the transformer. Then, we will review Eagle 7B, a multilingual RWKV model. We will see how to use it with Hugging Face Transformers.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More