Accelerating PyTorch Inference on Apple Devices with AI-Generated Metal Kernels

Tools & Engineering

The Engineer

4 Sept 2025 · 3 min read

AI-generated Metal kernels for PyTorch on Apple devices outperform manually written ones, offering speedups of up to 1.87x and unlocking new levels of computational efficiency.

Speeding up PyTorch Inference on Apple Devices with AI-Generated Metal Kernels

tl;dr: Our team at Gimlet Labs explored whether advanced AI models can automatically write optimized GPU kernels for Apple devices to boost inference performance. The results were impressive: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 and 1.87x faster across KernelBench v0.

Why Use AI to Generate Kernels for Apple Devices?

When it comes to running AI models on hardware, the efficiency of GPU kernels is crucial. These kernels define each operation, and their optimization can significantly impact how fast models run during both training and inference. Recent advancements like FlashAttention [1] have shown dramatic speedups over baseline implementations, highlighting the importance of performant kernels.

While tools like PyTorch and torch.compile [2] handle some kernel optimizations, there's still a significant performance gap that relies on hand-tuned kernels. Writing these optimized kernels is challenging and time-consuming, requiring deep expertise in GPU programming. This challenge becomes even more pronounced when working with non-CUDA platforms, where expertise is scarce and tooling is limited.

Our Approach: Autonomous Kernel Optimization Using AI

We aimed to answer a straightforward question: can advanced AI models automatically implement kernel optimizations across different backends? Given that billions of Apple devices rely on Metal kernels, which are often under-optimized, we started with this platform.

Our vision was clear: autonomous kernel optimization for any target platform using frontier AI models. The results were promising:

Performance Gains: Across 215 PyTorch modules, the generated kernels ran 87% faster on Apple hardware compared to baseline PyTorch.
No Expertise Required: This approach requires no expertise in kernel engineering and can be done nearly instantly.
Algorithmic Efficiency: The models identified and removed algorithmically unnecessary work that PyTorch didn't catch.

Key Findings

1. Algorithmic Optimization

Our AI models were able to surface and eliminate redundant operations that PyTorch missed. This not only improved performance but also reduced the computational overhead, making the kernels more efficient.

2. Impact of Performance Profiling and CUDA Reference Code

Incorporating performance profiling and using CUDA reference code as a benchmark helped guide the AI models in generating more optimized Metal kernels. This approach ensured that the generated kernels were not only faster but also adhered to best practices.

3. Agentic Swarm Dominance

We found that a simple agentic swarm of AI models outperformed individual frontier models. The collective intelligence and diverse perspectives of multiple models led to better optimization outcomes.

Update for KernelBench v0.1

The initial version of this blog focused on results from KernelBench v0, an earlier benchmark. We have since run our experiments on KernelBench v0.1, which includes improvements such as larger shape sizes.

Performance Results on KernelBench v0.1

Mean Measured Performance: 1.24x faster across KernelBench v0.1 problems.
Baseline Comparison: 1.87x faster across KernelBench v0 problems.

Conclusion

Our research demonstrates the potential of AI-generated Metal kernels to significantly speed up PyTorch inference on Apple devices. By leveraging advanced AI models, we can achieve performance gains without the need for extensive kernel engineering expertise. This approach opens new possibilities for optimizing AI models on a wide range of hardware platforms.