I think the last time I reviewed a hybrid architecture was with Jamba, and that was over a year ago!
Now, with the release of NVIDIA’s new Nemotron-H models, which show strong performance in terms of accuracy, inference speed, and memory efficiency, it's a great opportunity to revisit the evolving landscape of hybrid LLMs.
To be clear, I don’t believe hybrid models will surpass standard Transformer architectures in quality or popularity. Transformers continue to become more efficient and widely adopted, while hybrid models have been around for some time without achieving mainstream traction.
However, hybrid models remain a valuable area of research. Hybrid approaches often reveal unique behaviors and insights that are both intriguing and useful.
In this article, we’ll take a closer look at the Nemotron-H models, exploring what’s new, how NVIDIA trained them, and what you need to fine-tune them. The good news? They’re easy to try out, which is rarely the case with non-standard LLMs.
Since the Nemotron-H models are only base models, i.e., they have only been pre-trained on a large dataset, they must be fine-tuned to be useful. In this notebook, I propose the fine-tuning code (full fine-tuning and LoRA) based on TRL and Transformers: