The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
Jamba: The New Hybrid Transformer/Mamba
Copy link
Facebook
Email
Notes
More

Jamba: The New Hybrid Transformer/Mamba

Faster and better than the transformer but more difficult to train

Benjamin Marie's avatar
Benjamin Marie
Apr 25, 2024
∙ Paid
4

Share this post

The Salt - Curated AI
The Salt - Curated AI
Jamba: The New Hybrid Transformer/Mamba
Copy link
Facebook
Email
Notes
More
Share
Generated with DALL-E

The transformer neural architecture is state-of-the-art. It scales very well, i.e., the larger models learn better, and is efficient to train thanks to the parallel computation of the attention.

However, the transformer also has a few drawbacks, especially for inference. The computational cost of the attention grows quadratically with the length of the sequence to process. Many techniques have been proposed to alleviate this cost, such as Alibi and RoPE.

LongRoPE: Towards Unlimited Context Length for the Transformer

LongRoPE: Towards Unlimited Context Length for the Transformer

Benjamin Marie
·
March 6, 2024
Read full story

Alternative neural architectures have also been proposed, such as RWKV and Mamba, a state-space model (SSM), which are attention-free. They are much more efficient for inference than the transformer but still underperform in terms of accuracy.

RWKV: As Good as the Transformer But Faster?

RWKV: As Good as the Transformer But Faster?

Benjamin Marie
·
February 13, 2024
Read full story

To take advantage of both the transformer and SSM architecture, Jamba has been proposed. This hybrid model combines SSM and transformer layers. This combination allows balancing memory usage, efficient training, and long context capabilities.

Jamba performs as well as Mixtral-7x8B, one of the best open LLMs, but is more efficient, especially when dealing with long context.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review Jamba. We will have a close look at its architecture and training. We will also explore how to fine-tune and quantize the model to reduce its size.

I made a notebook for fine-tuning Jamba here:

Get the notebook (#4)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More