
Share
Researchers at LMSys unveil SGLang, a new language that boosts Large Language Models' efficiency through RadixAttention, which automates key-value cache reuse, making complex tasks faster and more manageable.
Large Language Models (LLMs) have become indispensable for a wide range of tasks, from chatbots to complex reasoning systems. However, the efficiency of these models in real-world applications often falls short due to the lack of optimized execution frameworks. To tackle this, researchers at LMSys have introduced SGLang, a Structured Generation Language designed to enhance both the performance and controllability of LLMs. At its core, SGLang leverages RadixAttention, an innovative technique for automatic KV cache reuse, and a flexible domain-specific language (DSL) embedded in Python.
During the development of SGLang, the team identified KV cache reuse as a critical optimization opportunity. In many LLM applications, different prompts often share common prefixes, which means they can reuse intermediate KV caches to avoid redundant computation and memory usage. However, existing systems either lack this capability or require manual configuration, which is impractical for diverse use cases.
The frontend of SGLang is designed to be flexible and user-friendly. It provides a Python-embedded DSL that allows developers to control the generation process in two modes:

To evaluate the effectiveness of SGLang, the team implemented common LLM workloads using the Llama-7B and Mixtral-8x7B models on NVIDIA A10G GPUs. The results are compelling:
SGLang has been used to implement various LLM workloads, including:
SGLang and RadixAttention represent a significant step forward in the efficient execution of complex LLM applications. By addressing the critical issue of KV cache reuse and providing a flexible frontend DSL, SGLang enhances both performance and controllability. The open-source release of the code and tech report further encourages community adoption and innovation.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
18 January 2024
88 articles
Related Articles
Related Articles
More Stories