ThunderMLA: A 20-35% Performance Boost for LLM Inference with Fused Megakernels

Models & Research

The Engineer

7 Mar 2025 · 3 min read

ThunderMLA tackles the bottleneck of kernel launches in LLM inference, offering a significant 20-35% performance boost by integrating DeepSeek’s FlashMLA into a more efficient "megakernel" design.

If you’re working on large language model (LLM) inference, you’ve likely encountered the challenges of variable-length sequences and batched requests. DeepSeek’s FlashMLA was a significant step forward in this domain, but we at Stanford Hazy Research decided to push it even further. Introducing ThunderMLA: a fully fused "megakernel" for decode that delivers 20-35% better performance on diverse workloads.

The Problem with Kernel Launches

Kernel launches can be the natural predator of performance in attention decoding, especially when dealing with variable-prompt workloads. Prefill operations usually allow for efficient GPU utilization by parallelizing across batch and sequence lengths. However, during decode, where you’re often handling small batches and a few queries at a time, maintaining high GPU efficiency becomes tricky.

The core issues are:

Two Separate Kernel Launches: Running separate prefill and decode kernels introduces significant overhead.
Tail Effects: The performance of one kernel can affect the next, leading to suboptimal utilization.
Limited Batch Sizes: Large reasoning models (which seem to be the future of AI) often require smaller batch sizes, further complicating efficient parallelization.

A Real-World Example

Consider a decode scenario with imbalanced inputs:

Batch Size: 4 prompts with lengths [4641, 45118, 1730, 1696]
Tokens to Generate: 4 new tokens (e.g., for speculation)
Tensor Parallelism: 8-way (16 heads per GPU for DeepSeek R1)

On an SXM H100, FlashMLA takes 52 microseconds to run, achieving only 144 TFLOPS and 1199 GB/s. This is far from the advertised 939 TFLOPS and 3300 GB/s by NVIDIA.

Introducing ThunderMLA

ThunderMLA addresses these issues with a fully fused megakernel that combines prefill and decode operations into a single, optimized launch. Here’s how it works:

Fused Kernel: By fusing the prefill and decode stages, we eliminate the overhead of multiple kernel launches.
Efficient Scheduling: We use advanced scheduling techniques to better manage the workload, especially for small batches and variable lengths.
Optimized Memory Access: The megakernel is designed to minimize memory latency and maximize bandwidth utilization.

On the same workload, ThunderMLA runs in just 41 microseconds, achieving 183 TFLOPS and 1520 GB/s. That’s a 20-35% performance improvement over FlashMLA.

Implementation Details

To achieve these results, we made several key changes:

Kernel Fusion: We fused the prefill and decode stages into a single CUDA kernel.
Advanced Scheduling: We implemented a scheduler that dynamically adjusts to the workload, ensuring optimal GPU utilization.
Memory Optimization: We optimized memory access patterns to reduce latency and increase bandwidth.

Here’s a brief overview of the architecture:

Prefill Stage: Handles the initial computation for each prompt.
Decode Stage: Generates new tokens based on the prefill results.
Fusion Layer: Combines the outputs of both stages into a single, efficient kernel launch.

Beyond Attention Decoding

While this release is focused on attention decoding, we believe these techniques can be applied more broadly. The idea of fusing operations and optimizing scheduling has potential applications in various areas of LLM inference and beyond.

Getting Started

If you’re interested in trying out ThunderMLA, you can find the code here. Note that this release is more of an experimental prototype, but we’ve been surprised to see it already in use in production environments.

Conclusion

ThunderMLA represents a significant step forward in optimizing LLM inference performance. By fusing prefill and decode operations into a single megakernel, we’ve achieved substantial performance gains on variable-prompt workloads. We’re excited to see how these techniques evolve