Leveraging Thread Block Clusters and 2-SM UMMA for GEMM on NVIDIA Blackwell GPUs with CUTLASS

Tools & Engineering

The Engineer

12 May 2025 · 3 min read

This article explores advanced techniques for optimizing General Matrix Multiplication on NVIDIA Blackwell GPUs, focusing on thread block clusters and 2-SM UMMA to enhance performance beyond basic Tensor Memory Accelerator usage.

In this part of our series, we delve deeper into optimizing General Matrix Multiplication (GEMM) on the NVIDIA Blackwell architecture using CUTLASS. If you missed part one, we covered the basics of Blackwell’s Tensor Memory Accelerator (TMA) and how to write a simple GEMM kernel with UMMA instructions (tcgen05.mma). Now, let's explore how to utilize thread block clusters and 2-SM UMMA for more efficient GEMM operations.

Thread Block Clusters

Thread block cluster is a feature that groups physically close Streaming Multiprocessors (SMs) together. This ensures that the thread blocks within a cluster are co-scheduled on SMs located in the same GPU Processing Cluster (GPC). Introduced in the NVIDIA Hopper architecture, this feature adds a new level of hierarchy for advanced cooperation between neighboring thread blocks.

Distributed Shared Memory: Thread blocks in a cluster can access each other’s shared memory.
Collaborative Data Loading: They can collaboratively load data using TMA multicast.
Synchronization: They can synchronize with each other using jointly visible mbarriers.

Using Tensor Memory Accelerator (TMA) with Thread Block Clusters

The TMA is a key component for efficient global memory transfers. When combined with thread block clusters, it allows you to split these transfers among participating Cooperative Thread Arrays (CTAs). Here’s how:

TMA Multicast: This feature enables multiple CTAs to load data from the same global memory address simultaneously. This reduces the number of global memory transactions and improves bandwidth utilization.
Cluster Configuration: You need to configure your thread blocks into clusters at launch time. This ensures they are co-scheduled on the same GPC.

Implementing TMA Multicast in CUTLASS

To implement TMA multicast in a GEMM kernel, follow these steps:

Define Cluster Size: Determine the number of CTAs per cluster.
Launch Configuration: Use the CUDA launch configuration to specify the cluster size.
TMA Setup: Initialize TMA operations for data loading.

Here’s a simplified example from the CuTe Blackwell examples (example 3):

// Define the cluster size
int num_ctas_per_cluster = 4;

// Launch configuration with thread block clusters dim3 grid(num_blocks / num_ctas_per_cluster, 1, 1); dim3 block(thread_block_size, 1, 1);

// Configure TMA multicast cutlass::gemm::threadblock::MmaPipelined< cutlass::MatrixShape<16, 16>, cutlass::MatrixShape<16, 16>, cutlass::MatrixShape<16, 16>, float, float, float

::configure_tma(tma_descriptor);

// Launch the kernel kernel_gemm<<<grid, block>>>(...);


### Using Blackwell 2-SM UMMA with CTA Pairs

Blackwell’s 2-SM UMMA feature allows two SMs to work together on a single Matrix Multiply-Accumulate (MMA) operation. This increases the arithmetic intensity of MMA by effectively doubling the compute resources.

1. **CTA Pairing**: Group CTAs into pairs, each pair working on a single MMA operation.
2. **Synchronization Primitives**: Use new synchronization primitives to ensure correct coordination between the paired CTAs.

### Implementing 2-SM UMMA in CUTLASS

To implement 2-SM UMMA in a GEMM kernel, follow these steps:

1. **Define CTA Pair Size**: Determine the number of pairs per cluster.
2. **Launch Configuration**: Use the CUDA launch configuration to specify the pair size.
3. **Synchronization Setup**: Initialize synchronization primitives for CTA pairs.

Here’s a simplified example from the CuTe Blackwell examples (example 4):

```cpp
// Define the number of CTA pairs per cluster
int num_cta_pairs_per_cluster = 2;

// Launch configuration with CTA pairs
dim3 grid(num_blocks / (num_ctas_per_cluster * num_cta_pairs_per_cluster),