DeepSeek-V2: A Huge LLM with Efficient Inference
A review of DeepSeek-V2's architecture with code to fine-tune and run DeepSeek-V2 Lite
Since the release of Mixtral-8x7B by Mistal AI, mixture-of-experts (MoE) LLMs have been shown to perform as well as standard “dense” models of similar sizes while being cheaper for inference. This is because not all the parameters of an MoE are active during inference. Only a subset of experts is effectively used. For instance, Mixtra-8x7B and Mixtral-8x22B only activate two experts among eight. The decision to activate experts is taken by a router network.
With models like Mixtral-8x22B and DBRX, we now have very large MoEs with large experts.
DeepSeek AI released an even larger model, DeepSeek-V2, which has 236B parameters. It’s a huge model. DeepSeek-V2 has 160 experts (+2 shared experts) but only 6 experts are activated during inference. In other words, only 21B parameters are used. Yet, the model achieves a strong performance in downstream tasks placing it close to other LLMs using many more active parameters such as Llama 3 70B.
In this article, I review DeepSeek-V2. More particularly, I deep dive into its architectures and highlight the main features of the model. I also review how the model was trained by DeepSeek AI. DeepSeek-V2 consumes too much memory to run locally but we can easily fine-tune and run DeepSeek-V2 Lite on consumer hardware.
A notebook demonstrating fine-tuning for DeepSeek-V2 Lite is available here: