ThunderMittens: Porting ThunderKittens to Apple Silicon for Efficient Edge AI

Tools & Engineering

The Engineer

29 Nov 2024 · 3 min read

ThunderMittens harnesses Apple Silicon’s power by porting ThunderKittens, streamlining machine learning tasks on devices like the M2 Pro and pushing the boundaries of edge AI efficiency.

With the increasing demand for on-edge training and inference, optimizing machine learning (ML) models for edge devices has become a critical challenge. Traditional data center GPUs offer massive compute power but are overkill for many edge use cases. Enter ThunderMittens, a project by HazyResearch that ports ThunderKittens (TK) to Apple Silicon using Metal Shading Language (MSL). This initiative aims to bring high-performance ML to the Apple M2 Pro, a chip with unique hardware properties that require a tailored approach.

Understanding the Apple M2 Pro

The Apple M2 Pro is known for its impressive memory bandwidth relative to compute power. It offers around 200GB/s of memory bandwidth and approximately 6.5 TFLOPs of compute. For context, consumer-grade NVIDIA RTX 4090s provide about 1000GB/s of memory bandwidth and 82.58 TFLOPs of compute, achieving a flops-to-byte ratio of 2.5x. This means the M2 Pro has:

High Memory Bandwidth: The high memory bandwidth allows for direct loading of values from High-Bandwidth Memory (HBM) into registers without relying heavily on shared memory.
Simplified Kernels: Simple kernels can perform well, often eliminating the need for complex producer/consumer asynchrony.
Limited Swizzling: ALU operations are too valuable to be used for faster memory loads, so padding is a more practical solution to bank conflicts.
bf16 Support: The compiler struggles with optimizing brain floating-point (bf16) operations, necessitating manual optimizations like meta template loop unrolling.

Porting ThunderKittens to Metal

To port TK to MSL, the team had to address several hardware-specific challenges:

Memory Management: Given the high memory bandwidth, shared memory isn't as crucial. Direct loading from HBM into registers is often sufficient.
Kernel Simplicity: Simple kernels without complex optimizations can achieve good performance.
Swizzling and Padding: Swizzling isn't worth the effort due to the limited ALU operations, so padding is used to handle bank conflicts.
bf16 Optimization: The compiler's struggle with bf16 required creative solutions, such as meta template loop unrolling, to ensure optimal performance.

Performance Considerations

One of the key challenges was maintaining high occupancy, especially for Fast Attention (FA) kernels. For example, a FA kernel with dimension D=128 can significantly impact performance due to increased register usage and reduced occupancy. The team had to carefully balance these factors to achieve optimal results.

Implementation Details

Memory Bandwidth Utilization: Direct loading from HBM into registers is crucial for maintaining high throughput.
Kernel Optimization: Simplified kernels with minimal overhead are preferred to maximize performance.
Padding for Bank Conflicts: Padding helps avoid bank conflicts, ensuring efficient memory access.
bf16 Meta Template Loop Unrolling: This technique forces the compiler to optimize bf16 operations more effectively.

Benchmarks and Results

The team conducted extensive benchmarks to evaluate the performance of ThunderMittens on the M2 Pro. The results showed that:

Fast Attention Kernels: For D=128, careful register management and occupancy optimization were essential to achieve acceptable performance.
General Performance: Simple and efficient kernels performed well, leveraging the high memory bandwidth and compute capabilities of the M2 Pro.

Conclusion

ThunderMittens is a significant step towards bringing high-performance ML to edge devices like the Apple M2 Pro. By addressing hardware-specific challenges and optimizing for performance, this project paves the way for more efficient on-edge training and inference. The insights gained from this port can inform future developments in edge AI, making it more accessible and powerful.