
Share
ThunderKittens returns with a purr-fectly updated version 2.0, boasting faster kernels, chatty AI models, and more adorable feline features to delight developers and cat lovers alike.
Five months ago, we introduced ThunderKittens and our GPU optimizations, which received a warm welcome on social media. Today, we're back with an even more powerful and adorable update: ThunderKittens 2.0. This release brings significant performance improvements, new kernels, and some fun additions like talking models and, of course, cuter kittens.
The primary goal of ThunderKittens has always been to facilitate the development and research of high-performance GPU kernels. We’ve added several new kernels that outperform existing implementations:
Fused Mamba-2: This kernel is several times faster than the current Triton implementation, thanks to more aggressive kernel fusions. It uses a slightly different layout compared to the standard Triton version, which might be beneficial in certain scenarios.
Long Convolutions: At sequence lengths of 4096, we achieve up to 9x speedup over the FlashFFTConv implementation. This is particularly useful for long-sequence tasks.
Linear Attention: Our linear attention kernels are significantly faster:
Rope, LayerNorm, Linear Layers: These kernels are competitive with or sometimes faster than existing implementations, while maintaining readability and conciseness.
One of the most exciting additions in this release is the integration of talking models. We’ve made it easier to run and train large language models using ThunderKittens:
demos/llama_demo and run bash demo_8b.sh. We’ve also added example training integrations with nanoGPT and PyTorch Lightning, ensuring successful training runs.
cd demos/lolcats_demo && bash demo_8b.sh to get started.To give you a better idea of the performance gains, here are some key benchmarks:
The performance improvements in ThunderKittens 2.0 are achieved through several key optimizations:
ThunderKittens 2.0 represents a significant step forward in GPU kernel optimization and machine learning research. Whether you're looking to speed up your models, train large language models more efficiently, or just enjoy some adorable kitten demos, this release has something for everyone. We’re excited to see what the community does with these new tools and look forward to continuing our efforts to push the boundaries of compute performance.
Tags
Original Sources
↗ https://hazyresearch.stanford.edu/blog/2024-10-29-tk2?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
31 October 2024
88 articles
Related Articles
Related Articles
More Stories