
Share
DeepSeek's V3 model, with its innovative mixture-of-experts architecture, rivals top-tier AI giants despite being open-source, setting a new standard for efficiency and performance in ultra-large language models.
DeepSeek, a Chinese AI startup known for pushing the boundaries of open-source AI, has released its latest ultra-large model, DeepSeek-V3. This new model, available on Hugging Face under the company’s license agreement, boasts 671 billion parameters but leverages a mixture-of-experts (MoE) architecture to maintain efficiency and performance. According to benchmarks shared by DeepSeek, DeepSeek-V3 outperforms leading open-source models like Meta's Llama 3.1-405B and is on par with closed models from Anthropic and OpenAI.
DeepSeek-V3 builds upon the foundational architecture of its predecessor, DeepSeek-V2, which includes multi-head latent attention (MLA) and DeepSeekMoE. These components ensure efficient training and inference by activating only 37 billion parameters out of the total 671 billion for each token. This selective activation is crucial for maintaining performance without overwhelming computational resources.
One of the key innovations in DeepSeek-V3 is an auxiliary loss-free load-balancing strategy. This dynamic approach monitors and adjusts the load on experts (the smaller neural networks within the model) to ensure balanced utilization. By doing so, it avoids compromising overall model performance while maintaining efficiency.
Another significant advancement is multi-token prediction (MTP). MTP allows DeepSeek-V3 to predict multiple future tokens simultaneously, enhancing both training and inference efficiency. This feature results in a threefold speed improvement, with the model generating 60 tokens per second during inference.

During pre-training, DeepSeek-V3 was trained on 14.8 trillion high-quality and diverse tokens. The company then conducted a two-stage context length extension to further enhance the model’s capabilities:
DeepSeek-V3 has demonstrated impressive performance across various benchmarks. It outperforms Meta’s Llama 3.1-405B and closely matches the performance of closed models from Anthropic and OpenAI. This achievement highlights the growing competitiveness of open-source AI models, closing the gap with proprietary solutions.
The release of DeepSeek-V3 is a significant milestone in the ongoing development of artificial general intelligence (AGI). By providing access to such a powerful model under an open-source license, DeepSeek aims to democratize advanced AI capabilities and accelerate research. This move could pave the way for more innovative applications and further advancements in the field.
DeepSeek-V3 represents a significant leap forward in the realm of ultra-large AI models. Its efficient architecture, coupled with novel innovations like auxiliary loss-free load-balancing and multi-token prediction, sets it apart from its competitors. As the AI community continues to push the boundaries of what is possible, models like DeepSeek-V3 will play a crucial role in shaping the future of artificial intelligence.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
31 December 2024
88 articles
Related Articles
Related Articles
More Stories