
Share
ThunderMLA tackles the bottleneck of kernel launches in LLM inference, offering a significant 20-35% performance boost by integrating DeepSeek’s FlashMLA into a more efficient "megakernel" design.
If you’re working on large language model (LLM) inference, you’ve likely encountered the challenges of variable-length sequences and batched requests. DeepSeek’s FlashMLA was a significant step forward in this domain, but we at Stanford Hazy Research decided to push it even further. Introducing ThunderMLA: a fully fused "megakernel" for decode that delivers 20-35% better performance on diverse workloads.
Kernel launches can be the natural predator of performance in attention decoding, especially when dealing with variable-prompt workloads. Prefill operations usually allow for efficient GPU utilization by parallelizing across batch and sequence lengths. However, during decode, where you’re often handling small batches and a few queries at a time, maintaining high GPU efficiency becomes tricky.
The core issues are:
Consider a decode scenario with imbalanced inputs:
On an SXM H100, FlashMLA takes 52 microseconds to run, achieving only 144 TFLOPS and 1199 GB/s. This is far from the advertised 939 TFLOPS and 3300 GB/s by NVIDIA.
ThunderMLA addresses these issues with a fully fused megakernel that combines prefill and decode operations into a single, optimized launch. Here’s how it works:

On the same workload, ThunderMLA runs in just 41 microseconds, achieving 183 TFLOPS and 1520 GB/s. That’s a 20-35% performance improvement over FlashMLA.
To achieve these results, we made several key changes:
Here’s a brief overview of the architecture:
While this release is focused on attention decoding, we believe these techniques can be applied more broadly. The idea of fusing operations and optimizing scheduling has potential applications in various areas of LLM inference and beyond.
If you’re interested in trying out ThunderMLA, you can find the code here. Note that this release is more of an experimental prototype, but we’ve been surprised to see it already in use in production environments.
ThunderMLA represents a significant step forward in optimizing LLM inference performance. By fusing prefill and decode operations into a single megakernel, we’ve achieved substantial performance gains on variable-prompt workloads. We’re excited to see how these techniques evolve
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
7 March 2025
88 articles
Related Articles
Related Articles
More Stories