SGLang and RadixAttention: Accelerating Complex LLM Workloads with Efficient KV Cache Reuse

Models & Research

The Engineer

18 Jan 2024 · 3 min read

Researchers at LMSys unveil SGLang, a new language that boosts Large Language Models' efficiency through RadixAttention, which automates key-value cache reuse, making complex tasks faster and more manageable.

Large Language Models (LLMs) have become indispensable for a wide range of tasks, from chatbots to complex reasoning systems. However, the efficiency of these models in real-world applications often falls short due to the lack of optimized execution frameworks. To tackle this, researchers at LMSys have introduced SGLang, a Structured Generation Language designed to enhance both the performance and controllability of LLMs. At its core, SGLang leverages RadixAttention, an innovative technique for automatic KV cache reuse, and a flexible domain-specific language (DSL) embedded in Python.

Key Technical Changes and Their Impact

RadixAttention for Efficient KV Cache Reuse: One of the primary bottlenecks in LLM inference is redundant computation and memory usage when handling multiple generation calls. RadixAttention addresses this by automatically identifying and reusing key-value (KV) cache entries across different prompts with shared prefixes. This reduces both computational overhead and memory footprint, leading to significant performance gains.
Flexible Frontend DSL: SGLang's frontend provides a Python-embedded DSL that allows developers to control the generation process in either interpreter or compiler mode. This flexibility is crucial for optimizing complex workflows involving multiple LLM calls, advanced prompting techniques, and interactions with external environments.

RadixAttention: The Backend Optimization

During the development of SGLang, the team identified KV cache reuse as a critical optimization opportunity. In many LLM applications, different prompts often share common prefixes, which means they can reuse intermediate KV caches to avoid redundant computation and memory usage. However, existing systems either lack this capability or require manual configuration, which is impractical for diverse use cases.

How RadixAttention Works

Automatic Identification of Reuse Patterns: RadixAttention automatically detects patterns where different prompts share the same prefix. This detection is done at runtime, ensuring that the system can adapt to a wide range of scenarios without requiring manual intervention.
Efficient Cache Management: Once a shared prefix is identified, RadixAttention efficiently manages the KV cache by reusing existing entries and only computing the necessary updates. This reduces both memory usage and computational time.

Frontend: The Python-Embedded DSL

The frontend of SGLang is designed to be flexible and user-friendly. It provides a Python-embedded DSL that allows developers to control the generation process in two modes:

Interpreter Mode: In this mode, the DSL is executed line by line, making it ideal for interactive development and debugging.
Compiler Mode: For production environments, the DSL can be compiled into optimized code, which can then be executed more efficiently.

Performance Benchmarks

To evaluate the effectiveness of SGLang, the team implemented common LLM workloads using the Llama-7B and Mixtral-8x7B models on NVIDIA A10G GPUs. The results are compelling:

Llama-7B (A10G, FP16, Tensor Parallelism=1): SGLang achieved up to 5 times higher throughput compared to existing systems like Guidance and vLLM.
Mixtral-8x7B (A10G, FP16, Tensor Parallelism=8): Similar performance gains were observed, with SGLang outperforming other systems by a significant margin.

Use Cases

SGLang has been used to implement various LLM workloads, including:

Agent: Complex multi-step reasoning tasks.
Reasoning: Logical and mathematical problem-solving.
Extraction: Information extraction from unstructured data.
Chat: Conversational agents.
Few-Shot Learning: Rapid adaptation to new tasks with minimal examples.

Conclusion

SGLang and RadixAttention represent a significant step forward in the efficient execution of complex LLM applications. By addressing the critical issue of KV cache reuse and providing a flexible frontend DSL, SGLang enhances both performance and controllability. The open-source release of the code and tech report further encourages community adoption and innovation.