
Share
Exploring the intricacies of NVIDIA's H100 GPU architecture reveals how optimized matrix multiplication kernels can dramatically enhance transformer models' efficiency and speed.
If you're working with transformers, you know that a significant portion of the floating-point operations (FLOPs) happens inside matrix multiplications (matmuls). These operations are highly parallelizable, making them perfect candidates for GPU acceleration. Understanding how to optimize matmul kernels on NVIDIA GPUs can significantly boost your model's performance.
Transformers rely heavily on matmuls for various components:
Given the importance of these operations, optimizing them can lead to substantial performance gains. This article will guide you through the core hardware concepts and programming techniques that underpin state-of-the-art (SOTA) NVIDIA GPU matmul kernels, focusing on the Hopper H100 architecture.
To write performant GPU kernels, you need a solid understanding of the underlying hardware. Let's break down the key components:
The NVIDIA Hopper H100 GPU is a powerful architecture designed for high-performance computing. Here’s a brief overview:

Memory System:
Compute Pipelines:
To design a near-SOTA synchronous matmul kernel, you can use the warp-tiling method. This involves:
For even better performance, you can leverage asynchronous techniques:
This article is the first in a series. In subsequent posts, I plan to cover:
Understanding the Hopper H100 GPU's architecture is crucial for writing high-performance matmul kernels. By leveraging its advanced features, you can significantly boost the performance of your transformer models. Stay tuned for more in-depth explorations in future posts.
Tags
Original Sources
↗ https://www.aleksagordic.com/blog/matmul?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
30 September 2025
88 articles
Related Articles
Related Articles
More Stories