SGLang v0.3: 7x Faster DeepSeek MLA, 1.5x Speedup with torch.compile, and Multi-Image/Video Support in LLaVA-OneVision

Models & Research

The Engineer

6 Sept 2024 · 3 min read

SGLang v0.3 rockets performance with a 7x speed boost for DeepSeek MLA and a 1.5x latency cut using torch.compile, while also introducing multi-image/video support in LLaVA-OneVision.

We're excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates:

Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
Up to 1.5x lower latency with torch.compile on small batch sizes
Support for interleaved text and multi-image/video in LLaVA-OneVision
Support for interleaved window attention and 2x longer context length in Gemma-2

DeepSeek Multi-head Latent Attention (MLA) Throughput Optimizations

Multi-head Latent Attention (MLA) is a new attention variant introduced by the DeepSeek team to improve inference efficiency. Unlike standard attention mechanisms, MLA has unique characteristics that existing open-source libraries haven't fully optimized for. In SGLang v0.3, we implemented several optimizations for MLA, including:

Weight Absorption: Simplifying the computation by merging weights.
Grouped Decoding Kernels: Efficiently handling multiple decoding steps.
FP8 Batched MatMul: Utilizing FP8 precision for batch matrix multiplication to reduce memory bandwidth.
FP8 KV Cache Quantization: Reducing the memory footprint of key-value caches.

Benchmark results show that SGLang v0.3 with MLA optimizations achieves 3x to 7x higher throughput than the baseline system. The benchmarks measure peak output throughput using BF16 and FP8 on H100 GPUs (tensor-parallelism=1 for lite models and tensor-parallelism=8 for big models) on the ShareGPT datasets. Reproducible instructions are provided in the appendix.

While these results are encouraging, there is still room for improvement. We are actively working on more optimizations to fully reproduce the results from the DeepSeek paper. Related PRs include:

Torch.compile Latency Optimizations

Torch.compile is a major feature of PyTorch 2.0, designed to optimize performance on NVIDIA GPUs by performing aggressive fusion and generating highly efficient Triton kernels. We've integrated torch.compile into SGLang for linear/norm/activation layers, combining it with FlashInfer attention and sampling kernels.

Latency Reduction: With this combination, we observed up to 1.5x lower latency on small batch sizes (1 to 32). This is particularly beneficial for online serving scenarios where low latency is crucial.
Performance Comparison: SGLang outperforms gpt-fast at batch size 1, while supporting all online serving features, including continuous batching and RadixAttention for prefix caching.

We are actively collaborating with the torch.compile team to further enhance performance and ensure compatibility with the latest PyTorch updates.

Multi-Image/Video Support in LLaVA-OneVision

LLaVA-OneVision is a multimodal model that supports interleaved text and multi-image/video inputs. This update in SGLang v0.3 allows for more flexible and powerful use cases, such as generating captions for images or videos while incorporating textual context.

Interleaved Window Attention and Longer Context Length in Gemma-2

Gemma-2 now supports interleaved window attention, which