
Share
As AI models like DeepSeek-V3 push for higher throughput, they face a latency dilemma, forcing a choice between speedy processing and handling more requests simultaneously.
When it comes to serving AI models at scale, one of the most critical tradeoffs is between throughput (the number of requests a system can handle per second) and latency (the time it takes to process each request). This tradeoff is particularly pronounced with models like DeepSeek-V3, which are designed for high-throughput but often struggle with low-latency requirements. Let's dive into why this happens and how batch inference plays a crucial role.
In the world of AI inference, you can generally choose between two modes:
For models like DeepSeek-V3, the natural GPU inefficiency means that they must often be served in high-latency mode to achieve any reasonable throughput. This is where batch inference comes into play.
Batch inference involves processing multiple user requests simultaneously rather than one at a time. GPUs excel at performing large matrix multiplications (GEMMs), which are fundamental operations in deep learning models, especially transformers. Here's why batching is so effective:

Let's break down the typical workflow of an inference server using batch processing:
The server decides the batch size, which directly affects the tradeoff between throughput and latency:
DeepSeek-V3, like many transformer-based models, is designed to take advantage of high-throughput scenarios. However, this comes at the cost of higher latency when running locally or in low-latency environments. The model's architecture and computational requirements make it naturally GPU-inefficient for small batch sizes, which is why it performs best in high-batch, high-throughput settings.
Understanding the tradeoff between throughput and latency is crucial for optimizing AI inference systems. Batch inference is a powerful technique that leverages the strengths of GPUs to improve efficiency, but it requires careful tuning to balance performance and response times. For models like DeepSeek-V3, this means they are best suited for high-throughput environments where the latency can be managed effectively.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
2 June 2025
133 articles
Related Articles
Related Articles
More Stories