Vibe-Coding a Triton Kernel for GPT-OSS: A Practitioner's Journey

Tools & Engineering

The Engineer

18 Aug 2025 · 4 min read

Exploring the limitations of GPT-OSS's initial setup, this article delves into crafting a custom Triton kernel to optimize performance and overcome compatibility issues with existing attention mechanisms.

OpenAI recently released GPT-OSS, and while the model’s performance is impressive, the initial setup left much to be desired, especially in terms of fine-tuning. The recommended HuggingFace setup involves de-quantizing MXFP4 weights to BF16 (increasing memory consumption by ~4x), and the bespoke attention algorithm means that Flash Attention and Pytorch SDPA are incompatible, leaving only a slow and memory-hungry "eager" attention as an option.

To address these issues, I decided to dive into writing a Triton kernel for GPT-OSS. The release included a forward-only Triton implementation of the model, but this wasn’t sufficient for training. I started with the attention mechanism, which seemed less complex than the MXFP4 MoE (Mixture of Experts) kernel.

Background

Before diving into the details, let me clarify my background: while I have substantial machine learning knowledge, my experience with writing kernels is minimal. My expertise essentially comes from skimming a few chapters of "Programming Massively Parallel Processors" and working through a couple of Colab notebooks on CUDA basics. I understand fundamental concepts like breaking down work into smaller pieces using for loops and ensuring memory access stays within bounds. However, more advanced topics like tiled matrix multiplication, shared memory optimization, and swizzling are still foreign to me.

The Testing Harness

The most critical aspect of writing a kernel is ensuring its mathematical correctness. An incorrect kernel can degrade model performance or introduce subtle bugs that are hard to detect. To avoid this, I should have started with comprehensive tests. Instead, I used the Cursor CLI to generate a backward kernel based on the existing forward kernel and Pytorch reference implementation. Fortunately, Cursor also generated some tests for me without being explicitly asked.

Implementation Details

Forward Kernel

The forward attention kernel in GPT-OSS is already provided in Triton. Here’s a brief overview of its structure:

Input: Query, Key, and Value tensors.
Output: Attention output tensor.
Steps:
- Compute the dot product between Query and Key to get the attention scores.
- Apply the softmax function to normalize these scores.
- Multiply the normalized scores with the Value tensor to produce the final attention output.

Backward Kernel

The backward kernel is more complex as it needs to compute gradients for the input tensors. Here’s a high-level breakdown:

Input: Gradients of the loss with respect to the attention output, and the intermediate results from the forward pass (attention scores, softmax values).
Output: Gradients with respect to Query, Key, and Value.
Steps:
- Compute the gradients for the softmax operation.
- Propagate these gradients back through the dot product operations to get the gradients for Query, Key, and Value.

Triton Implementation

Triton is a high-performance GPU programming framework that simplifies writing efficient kernels. Here are some key points about the implementation:

Memory Management: Triton handles memory management efficiently by using block-level parallelism and shared memory.
Vectorization: It automatically vectorizes operations to maximize throughput.
Synchronization: Synchronization primitives ensure correct data access and prevent race conditions.

Testing and Validation

To validate the correctness of the backward kernel, I used a combination of unit tests and integration tests:

Unit Tests: These focused on individual components like the dot product and softmax functions.
Integration Tests: These verified the entire forward-backward pass using synthetic data and compared the results with Pytorch’s reference implementation.

Benchmarks

I haven’t yet benchmarked the performance of the Triton kernel, but initial tests indicate that it produces mathematically correct results. Further optimization and profiling will be necessary to ensure it meets performance expectations.

Conclusion

While my background in kernel programming is limited, I managed to create a functional Triton kernel for GPT-OSS attention. The process involved leveraging tools like Cursor CLI and Modal Labs’ Notebooks product to streamline development and testing. If you’re interested in fine-tuning GPT-OSS or exploring advanced GPU programming, this could be a useful starting point.