
Share
AI-generated Metal kernels for PyTorch on Apple devices outperform manually written ones, offering speedups of up to 1.87x and unlocking new levels of computational efficiency.
tl;dr: Our team at Gimlet Labs explored whether advanced AI models can automatically write optimized GPU kernels for Apple devices to boost inference performance. The results were impressive: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 and 1.87x faster across KernelBench v0.
When it comes to running AI models on hardware, the efficiency of GPU kernels is crucial. These kernels define each operation, and their optimization can significantly impact how fast models run during both training and inference. Recent advancements like FlashAttention [1] have shown dramatic speedups over baseline implementations, highlighting the importance of performant kernels.
While tools like PyTorch and torch.[compile](/articles/mastering-torchcompile-a-developers-guide-to-pytorch-performance-optimization) [2] handle some kernel optimizations, there's still a significant performance gap that relies on hand-tuned kernels. Writing these optimized kernels is challenging and time-consuming, requiring deep expertise in GPU programming. This challenge becomes even more pronounced when working with non-CUDA platforms, where expertise is scarce and tooling is limited.
We aimed to answer a straightforward question: can advanced AI models automatically implement kernel optimizations across different backends? Given that billions of Apple devices rely on Metal kernels, which are often under-optimized, we started with this platform.
Our vision was clear: autonomous kernel optimization for any target platform using frontier AI models. The results were promising:

Our AI models were able to surface and eliminate redundant operations that PyTorch missed. This not only improved performance but also reduced the computational overhead, making the kernels more efficient.
Incorporating performance profiling and using CUDA reference code as a benchmark helped guide the AI models in generating more optimized Metal kernels. This approach ensured that the generated kernels were not only faster but also adhered to best practices.
We found that a simple agentic swarm of AI models outperformed individual frontier models. The collective intelligence and diverse perspectives of multiple models led to better optimization outcomes.
The initial version of this blog focused on results from KernelBench v0, an earlier benchmark. We have since run our experiments on KernelBench v0.1, which includes improvements such as larger shape sizes.
Our research demonstrates the potential of AI-generated Metal kernels to significantly speed up PyTorch inference on Apple devices. By leveraging advanced AI models, we can achieve performance gains without the need for extensive kernel engineering expertise. This approach opens new possibilities for optimizing AI models on a wide range of hardware platforms.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
4 September 2025
88 articles
Related Articles
Related Articles
More Stories