
Share
SGLang v0.3 rockets performance with a 7x speed boost for DeepSeek MLA and a 1.5x latency cut using torch.compile, while also introducing multi-image/video support in LLaVA-OneVision.
We're excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates:
torch.[compile](/articles/mastering-torchcompile-a-developers-guide-to-pytorch-performance-optimization) on small batch sizesMulti-head Latent Attention (MLA) is a new attention variant introduced by the DeepSeek team to improve inference efficiency. Unlike standard attention mechanisms, MLA has unique characteristics that existing open-source libraries haven't fully optimized for. In SGLang v0.3, we implemented several optimizations for MLA, including:
Benchmark results show that SGLang v0.3 with MLA optimizations achieves 3x to 7x higher throughput than the baseline system. The benchmarks measure peak output throughput using BF16 and FP8 on H100 GPUs (tensor-parallelism=1 for lite models and tensor-parallelism=8 for big models) on the ShareGPT datasets. Reproducible instructions are provided in the appendix.
While these results are encouraging, there is still room for improvement. We are actively working on more optimizations to fully reproduce the results from the DeepSeek paper. Related PRs include:

Torch.compile is a major feature of PyTorch 2.0, designed to optimize performance on NVIDIA GPUs by performing aggressive fusion and generating highly efficient Triton kernels. We've integrated torch.compile into SGLang for linear/norm/activation layers, combining it with FlashInfer attention and sampling kernels.
We are actively collaborating with the torch.compile team to further enhance performance and ensure compatibility with the latest PyTorch updates.
LLaVA-OneVision is a multimodal model that supports interleaved text and multi-image/video inputs. This update in SGLang v0.3 allows for more flexible and powerful use cases, such as generating captions for images or videos while incorporating textual context.
Gemma-2 now supports interleaved window attention, which
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 September 2024
88 articles
Related Articles
Related Articles
More Stories