Optimizing SGEMM on GPUs with CUDA: A Deep Dive into High-Performance Matrix Multiplication

Tools & Engineering

The Engineer

15 Jan 2025 · 4 min read

Exploring the intricacies of high-performance matrix multiplication, this deep dive uncovers how to optimize SGEMM on GPUs using CUDA, surpassing standard library benchmarks.

When it comes to optimizing matrix multiplication (SGEMM) on GPUs, there's a significant gap between the theoretical knowledge found in books and blogs and the highly optimized implementations used in libraries like cuBLAS. This project, inspired by the works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao, and the GPU MODE community, aims to bridge that gap. The code is available at sgemm.cu, and this article complements a detailed blog post on the implementation of FP32 matrix multiplication that outperforms BLAS libraries on modern Intel and AMD CPUs.

1. Introduction

The goal of this project isn't to create an SGEMM that outperforms cuBLAS on all GPUs and matrix sizes. Given the existence of the open-source, lightweight CUTLASS library, such a goal would be impractical. Instead, this project targets CUDA learners by providing a clear, efficient implementation of SGEMM that can serve as a learning tool and a foundation for further optimization.

2. Technical Overview

The core operation in SGEMM is defined as C := alpha * A * B + beta * C, where A and B are input matrices, C is the output matrix, and alpha and beta are scalar values. The implementation focuses on several key optimization techniques:

Inlined PTX: Directly using PTX (Parallel Thread Execution) assembly to optimize low-level operations.
Asynchronous Memory Copies: Overlapping data transfers with computation to hide latency.
Double-Buffering: Using multiple buffers to overlap data loading and processing, improving throughput.
Avoiding Shared Memory Bank Conflicts: Ensuring that threads access shared memory in a way that minimizes bank conflicts.
Efficient Coalesced Storage: Utilizing shared memory for efficient coalesced storage of matrix blocks.

3. Implementation Details

The high-level algorithm design used in this project was developed by NVIDIA engineers and has been extensively studied in prior works on cuBLAS and CUTLASS. My main contribution was translating this design into efficient CUDA/PTX code. The implementation is expected to deliver high performance on Ada, Ampere, Volta, and Turing devices, with specific fine-tuning for the NVIDIA RTX 3090 (GA102 chip).

Key Features:

Kernel Design: The kernel is designed to maximize occupancy and minimize divergence by carefully managing thread blocks and warps.
Memory Access Patterns: Efficient memory access patterns are crucial for performance. The implementation uses coalesced global memory accesses and avoids bank conflicts in shared memory.
PTX Optimization: Inlined PTX code is used to optimize specific operations, such as loading data from global memory into registers.

4. Performance Benchmarks

The performance of the SGEMM implementation was benchmarked on an NVIDIA RTX 3090, comparing results with locked and unlocked GPU core frequencies against cuBLAS and Simon Boehm’s highly cited work (used in llamafile, aka tinyBLAS). The benchmarks show significant improvements in certain scenarios:

Locked vs. Unlocked Frequencies: Unlocking the GPU core frequencies can lead to a performance boost of up to 10%.
Comparison with cuBLAS: While the implementation doesn't outperform cuBLAS across all matrix sizes, it shows competitive performance for specific use cases.

5. Future Work

I plan to continue publishing educational content on high-performance kernels used in AI/ML. Some projects currently in development include:

Beating NVIDIA on Tensor Cores: Exploring ways to optimize tensor operations on modern GPUs.
Stream-K GEMM: Implementing a stream-based approach to matrix multiplication.
FlashAttention: Optimizing attention mechanisms for transformer models.
xLSTM: Enhancing LSTM performance through GPU optimization.

If you enjoy educational content like this and would like to see more, please share this article. Your feedback is greatly appreciated!

6. Conclusion

This project serves as a valuable resource for CUDA learners looking to understand the intricacies of