
Share
This article explores advanced techniques for optimizing General Matrix Multiplication on NVIDIA Blackwell GPUs, focusing on thread block clusters and 2-SM UMMA to enhance performance beyond basic Tensor Memory Accelerator usage.
In this part of our series, we delve deeper into optimizing General Matrix Multiplication (GEMM) on the NVIDIA Blackwell architecture using CUTLASS. If you missed part one, we covered the basics of Blackwell’s Tensor Memory Accelerator (TMA) and how to write a simple GEMM kernel with UMMA instructions (tcgen05.mma). Now, let's explore how to utilize thread block clusters and 2-SM UMMA for more efficient GEMM operations.
Thread block cluster is a feature that groups physically close Streaming Multiprocessors (SMs) together. This ensures that the thread blocks within a cluster are co-scheduled on SMs located in the same GPU Processing Cluster (GPC). Introduced in the NVIDIA Hopper architecture, this feature adds a new level of hierarchy for advanced cooperation between neighboring thread blocks.
The TMA is a key component for efficient global memory transfers. When combined with thread block clusters, it allows you to split these transfers among participating Cooperative Thread Arrays (CTAs). Here’s how:
To implement TMA multicast in a GEMM kernel, follow these steps:
Here’s a simplified example from the CuTe Blackwell examples (example 3):
// Define the cluster size
int num_ctas_per_cluster = 4;

// Launch configuration with thread block clusters dim3 grid(num_blocks / num_ctas_per_cluster, 1, 1); dim3 block(thread_block_size, 1, 1);
// Configure TMA multicast cutlass::gemm::threadblock::MmaPipelined< cutlass::MatrixShape<16, 16>, cutlass::MatrixShape<16, 16>, cutlass::MatrixShape<16, 16>, float, float, float
::configure_tma(tma_descriptor);
// Launch the kernel kernel_gemm<<<grid, block>>>(...);
### Using Blackwell 2-SM UMMA with CTA Pairs
Blackwell’s 2-SM UMMA feature allows two SMs to work together on a single Matrix Multiply-Accumulate (MMA) operation. This increases the arithmetic intensity of MMA by effectively doubling the compute resources.
1. **CTA Pairing**: Group CTAs into pairs, each pair working on a single MMA operation.
2. **Synchronization Primitives**: Use new synchronization primitives to ensure correct coordination between the paired CTAs.
### Implementing 2-SM UMMA in CUTLASS
To implement 2-SM UMMA in a GEMM kernel, follow these steps:
1. **Define CTA Pair Size**: Determine the number of pairs per cluster.
2. **Launch Configuration**: Use the CUDA launch configuration to specify the pair size.
3. **Synchronization Setup**: Initialize synchronization primitives for CTA pairs.
Here’s a simplified example from the CuTe Blackwell examples (example 4):
```cpp
// Define the number of CTA pairs per cluster
int num_cta_pairs_per_cluster = 2;
// Launch configuration with CTA pairs
dim3 grid(num_blocks / (num_ctas_per_cluster * num_cta_pairs_per_cluster),
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 May 2025
88 articles
Related Articles
Related Articles
More Stories