The Salt - Curated AI

The Salt - Curated AI

Share this post

The Salt - Curated AI
The Salt - Curated AI
DeepSeek-V2: A Huge LLM with Efficient Inference

DeepSeek-V2: A Huge LLM with Efficient Inference

A review of DeepSeek-V2's architecture with code to fine-tune and run DeepSeek-V2 Lite

Benjamin Marie's avatar
Benjamin Marie
May 22, 2024
∙ Paid
3

Share this post

The Salt - Curated AI
The Salt - Curated AI
DeepSeek-V2: A Huge LLM with Efficient Inference
Share
A cartoon-style image featuring 160 table salts engaged in a debate around a large table. Each salt shaker has a distinct expression and gesture, adding to the lively discussion. In the background, there are two supervisor table salts dressed formally in suits, overseeing the debate. The scene is colorful and vibrant, with a playful and humorous atmosphere.
Generated with DALL-E

Since the release of Mixtral-8x7B by Mistal AI, mixture-of-experts (MoE) LLMs have been shown to perform as well as standard “dense” models of similar sizes while being cheaper for inference. This is because not all the parameters of an MoE are active during inference. Only a subset of experts is effectively used. For instance, Mixtra-8x7B and Mixtral-8x22B only activate two experts among eight. The decision to activate experts is taken by a router network.

With models like Mixtral-8x22B and DBRX, we now have very large MoEs with large experts.

DeepSeek AI released an even larger model, DeepSeek-V2, which has 236B parameters. It’s a huge model. DeepSeek-V2 has 160 experts (+2 shared experts) but only 6 experts are activated during inference. In other words, only 21B parameters are used. Yet, the model achieves a strong performance in downstream tasks placing it close to other LLMs using many more active parameters such as Llama 3 70B.

The Salt - Curated AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review DeepSeek-V2. More particularly, I deep dive into its architectures and highlight the main features of the model. I also review how the model was trained by DeepSeek AI. DeepSeek-V2 consumes too much memory to run locally but we can easily fine-tune and run DeepSeek-V2 Lite on consumer hardware.

A notebook demonstrating fine-tuning for DeepSeek-V2 Lite is available here:

Get the notebook (#5)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Benjamin Marie
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share