ThunderKittens 2.0: Faster Kernels, Talking Models, and More Adorable Kittens

Models & Research

The Engineer

31 Oct 2024 · 3 min read

ThunderKittens returns with a purr-fectly updated version 2.0, boasting faster kernels, chatty AI models, and more adorable feline features to delight developers and cat lovers alike.

Five months ago, we introduced ThunderKittens and our GPU optimizations, which received a warm welcome on social media. Today, we're back with an even more powerful and adorable update: ThunderKittens 2.0. This release brings significant performance improvements, new kernels, and some fun additions like talking models and, of course, cuter kittens.

New Kernels for Enhanced Performance

The primary goal of ThunderKittens has always been to facilitate the development and research of high-performance GPU kernels. We’ve added several new kernels that outperform existing implementations:

Fused Mamba-2: This kernel is several times faster than the current Triton implementation, thanks to more aggressive kernel fusions. It uses a slightly different layout compared to the standard Triton version, which might be beneficial in certain scenarios.
Long Convolutions: At sequence lengths of 4096, we achieve up to 9x speedup over the FlashFFTConv implementation. This is particularly useful for long-sequence tasks.
Linear Attention: Our linear attention kernels are significantly faster:
- LoLCATS Hedgehog Linear Attention: 14x faster than Fast Linear Attention Triton implementations.
- Based Linear Attention: 6.5x faster, thanks to optimized register usage and H100 features.
Rope, LayerNorm, Linear Layers: These kernels are competitive with or sometimes faster than existing implementations, while maintaining readability and conciseness.

Talking Models and Adorable Demos

One of the most exciting additions in this release is the integration of talking models. We’ve made it easier to run and train large language models using ThunderKittens:

Llama3 8B and Qwen 2.5 7B: TK kernels now support these models, with demo scripts available for easy setup. Just navigate to demos/llama_demo and run bash demo_8b.sh. We’ve also added example training integrations with nanoGPT and PyTorch Lightning, ensuring successful training runs.

LoLCATS: Following up on our recent LoLCATs work, we’ve included a forward prefill kernel and a demo integration. Run cd demos/lolcats_demo && bash demo_8b.sh to get started.

Performance Benchmarks

To give you a better idea of the performance gains, here are some key benchmarks:

Fused Mamba-2: Up to 3x faster than Triton.
Long Convolutions: 9x speedup over FlashFFTConv at sequence length 4096.
Linear Attention (LoLCATS Hedgehog): 14x faster than Fast Linear Attention Triton.
Linear Attention (Based): 6.5x faster.

Under the Hood

The performance improvements in ThunderKittens 2.0 are achieved through several key optimizations:

Kernel Fusions: By fusing multiple operations into a single kernel, we reduce memory bandwidth usage and improve cache efficiency.
Register Optimization: Careful management of GPU registers ensures that data is accessed efficiently, reducing latency.
H100 Features: Leveraging the latest H100 GPU features allows us to take full advantage of modern hardware capabilities.

Conclusion

ThunderKittens 2.0 represents a significant step forward in GPU kernel optimization and machine learning research. Whether you're looking to speed up your models, train large language models more efficiently, or just enjoy some adorable kitten demos, this release has something for everyone. We’re excited to see what the community does with these new tools and look forward to continuing our efforts to push the boundaries of compute performance.