A Good Week for the State Space Neural Architecture

The Weekly Salt #20

Jun 04, 2024

Reviewed this week

Zamba: A Compact 7B SSM Hybrid Model
⭐Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

⭐: Papers that I particularly recommend reading.

New code repositories:

State-space models (SSM): Mamba and Mamba-2

I maintain a curated list of AI code repositories here:

Get the list

Zamba: A Compact 7B SSM Hybrid Model

This technical report introduces Zamba, a 7B Mamba-based SSM that employs a novel global shared attention architecture. Trained on 1T tokens of open web datasets, Zamba performs comparably to leading transformer-based models of a similar size.

Similar to Jamba, Zamba combines SSM with transformer layers. However, in Zamba, the self-attention layer is shared with Mamba layers while in Jamba the SSM and self-attention are mixed with a ratio that can be decided by the creators of the models.

These hybrid architectures improve inference efficiency for the same parameter and memory cost.

Jamba: The New Hybrid Transformer/Mamba

Benjamin Marie

April 25, 2024

Read full story

The training process of Zamba includes a two-phase approach: general pre-training followed by an annealing phase with rapid learning rate decay, significantly improving performance on some downstream evaluations.

The authors provide open access to all training checkpoints:

⭐Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

This paper offers multiple perspectives to combine the strengths of State Space Models (SSMs) and attention. The framework called structured state space duality (SSD), connects structured SSMs and attention through the abstractions of structured matrices.

Efficient algorithms developed under this framework expose new, easily implementable methods for computing SSMs. The SSD algorithm has significant improvements in efficiency, being faster than existing implementations and allowing for larger recurrent state sizes.

In terms of architecture design, the framework allows for the adoption of established conventions and techniques from Transformers to SSMs. This includes introducing multi-input SSMs (analogous to multi-head attention) and implementing tensor parallelism. The resulting Mamba-2 architecture, which incorporates SSD as the inner SSM layer, demonstrates superior performance over Mamba and Transformer in both perplexity and wall-clock time.

Empirical validation of Mamba-2 on language modeling and training efficiency supports this theoretical progress. Mamba-2 models trained on the Pile dataset match or outperform comparable models.

The authors released recipes to make Mamba-2 models in the Mamba repository:

GitHub: state-spaces/mamba

In this work, the authors argue that only relying on similarity for retrieval-augmented generation (RAG) can degrade performance. For example, a similarity-driven system might rank documents highly based on relevance metrics, yet provide minimal useful information. This can result in more informative documents being ranked lower due to lower similarity scores. Furthermore, when multiple documents are retrieved, using them in isolation or simply aggregating them can confuse large language models (LLMs) and lead to information loss and degraded performance.

To address these limitations, the authors propose a new approach called METRAG, which incorporates multi-layered considerations beyond similarity, such as utility and compactness. This approach aims to improve the performance by training a model to perceive utility-oriented information rather than just similarity. METRAG also summarizes retrieved documents to reduce their size and keep only important information. However, simple summarization might not retain the most important information relevant to the query, so a task-aligned summarization model is needed.

METRAG uses an LLM for supervising document utility, aligning the utility model with the LLM's feedback.

The proposed METRAG method is evaluated on various tasks. The scores look good for METRAG but the paper doesn’t provide enough information to make sure that all these scores were obtained with the same hyperparameters and prompts.

If you have any questions about one of these papers, write them in the comments. I will answer them.

The Salt - Curated AI

Jamba: The New Hybrid Transformer/Mamba

Discussion about this post