DeepSeek-V2: A Huge LLM with Efficient Inference

A review of DeepSeek-V2's architecture with code to fine-tune and run DeepSeek-V2 Lite

May 22, 2024

∙ Paid

A cartoon-style image featuring 160 table salts engaged in a debate around a large table. Each salt shaker has a distinct expression and gesture, adding to the lively discussion. In the background, there are two supervisor table salts dressed formally in suits, overseeing the debate. The scene is colorful and vibrant, with a playful and humorous atmosphere. — Generated with DALL-E

Since the release of Mixtral-8x7B by Mistal AI, mixture-of-experts (MoE) LLMs have been shown to perform as well as standard “dense” models of similar sizes while being cheaper for inference. This is because not all the parameters of an MoE are active during inference. Only a subset of experts is effectively used. For instance, Mixtra-8x7B and Mixtral-8x22B only activate two experts among eight. The decision to activate experts is taken by a router network.

With models like Mixtral-8x22B and DBRX, we now have very large MoEs with large experts.

DeepSeek AI released an even larger model, DeepSeek-V2, which has 236B parameters. It’s a huge model. DeepSeek-V2 has 160 experts (+2 shared experts) but only 6 experts are activated during inference. In other words, only 21B parameters are used. Yet, the model achieves a strong performance in downstream tasks placing it close to other LLMs using many more active parameters such as Llama 3 70B.

In this article, I review DeepSeek-V2. More particularly, I deep dive into its architectures and highlight the main features of the model. I also review how the model was trained by DeepSeek AI. DeepSeek-V2 consumes too much memory to run locally but we can easily fine-tune and run DeepSeek-V2 Lite on consumer hardware.

A notebook demonstrating fine-tuning for DeepSeek-V2 Lite is available here:

Get the notebook (#5)

The Salt - Curated AI

DeepSeek-V2: A Huge LLM with Efficient Inference

A review of DeepSeek-V2's architecture with code to fine-tune and run DeepSeek-V2 Lite

This post is for paid subscribers