
Share
Exploring the intricacies of high-performance matrix multiplication, this deep dive uncovers how to optimize SGEMM on GPUs using CUDA, surpassing standard library benchmarks.
When it comes to optimizing matrix multiplication (SGEMM) on GPUs, there's a significant gap between the theoretical knowledge found in books and blogs and the highly optimized implementations used in libraries like cuBLAS. This project, inspired by the works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao, and the GPU MODE community, aims to bridge that gap. The code is available at sgemm.cu, and this article complements a detailed blog post on the implementation of FP32 matrix multiplication that outperforms BLAS libraries on modern Intel and AMD CPUs.
The goal of this project isn't to create an SGEMM that outperforms cuBLAS on all GPUs and matrix sizes. Given the existence of the open-source, lightweight CUTLASS library, such a goal would be impractical. Instead, this project targets CUDA learners by providing a clear, efficient implementation of SGEMM that can serve as a learning tool and a foundation for further optimization.
The core operation in SGEMM is defined as C := alpha * A * B + beta * C, where A and B are input matrices, C is the output matrix, and alpha and beta are scalar values. The implementation focuses on several key optimization techniques:
The high-level algorithm design used in this project was developed by NVIDIA engineers and has been extensively studied in prior works on cuBLAS and CUTLASS. My main contribution was translating this design into efficient CUDA/PTX code. The implementation is expected to deliver high performance on Ada, Ampere, Volta, and Turing devices, with specific fine-tuning for the NVIDIA RTX 3090 (GA102 chip).

The performance of the SGEMM implementation was benchmarked on an NVIDIA RTX 3090, comparing results with locked and unlocked GPU core frequencies against cuBLAS and Simon Boehm’s highly cited work (used in llamafile, aka tinyBLAS). The benchmarks show significant improvements in certain scenarios:
I plan to continue publishing educational content on high-performance kernels used in AI/ML. Some projects currently in development include:
If you enjoy educational content like this and would like to see more, please share this article. Your feedback is greatly appreciated!
This project serves as a valuable resource for CUDA learners looking to understand the intricacies of
Tags
Original Sources
↗ https://salykova.github.io/sgemm-gpu?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 January 2025
88 articles
Related Articles
Related Articles
More Stories