
Share
Exploring the limitations of GPT-OSS's initial setup, this article delves into crafting a custom Triton kernel to optimize performance and overcome compatibility issues with existing attention mechanisms.
OpenAI recently released GPT-OSS, and while the model’s performance is impressive, the initial setup left much to be desired, especially in terms of fine-tuning. The recommended HuggingFace setup involves de-quantizing MXFP4 weights to BF16 (increasing memory consumption by ~4x), and the bespoke attention algorithm means that Flash Attention and Pytorch SDPA are incompatible, leaving only a slow and memory-hungry "eager" attention as an option.
To address these issues, I decided to dive into writing a Triton kernel for GPT-OSS. The release included a forward-only Triton implementation of the model, but this wasn’t sufficient for training. I started with the attention mechanism, which seemed less complex than the MXFP4 MoE (Mixture of Experts) kernel.
Before diving into the details, let me clarify my background: while I have substantial machine learning knowledge, my experience with writing kernels is minimal. My expertise essentially comes from skimming a few chapters of "Programming Massively Parallel Processors" and working through a couple of Colab notebooks on CUDA basics. I understand fundamental concepts like breaking down work into smaller pieces using for loops and ensuring memory access stays within bounds. However, more advanced topics like tiled matrix multiplication, shared memory optimization, and swizzling are still foreign to me.
The most critical aspect of writing a kernel is ensuring its mathematical correctness. An incorrect kernel can degrade model performance or introduce subtle bugs that are hard to detect. To avoid this, I should have started with comprehensive tests. Instead, I used the Cursor CLI to generate a backward kernel based on the existing forward kernel and Pytorch reference implementation. Fortunately, Cursor also generated some tests for me without being explicitly asked.
The forward attention kernel in GPT-OSS is already provided in Triton. Here’s a brief overview of its structure:

The backward kernel is more complex as it needs to compute gradients for the input tensors. Here’s a high-level breakdown:
Triton is a high-performance GPU programming framework that simplifies writing efficient kernels. Here are some key points about the implementation:
To validate the correctness of the backward kernel, I used a combination of unit tests and integration tests:
I haven’t yet benchmarked the performance of the Triton kernel, but initial tests indicate that it produces mathematically correct results. Further optimization and profiling will be necessary to ensure it meets performance expectations.
While my background in kernel programming is limited, I managed to create a functional Triton kernel for GPT-OSS attention. The process involved leveraging tools like Cursor CLI and Modal Labs’ Notebooks product to streamline development and testing. If you’re interested in fine-tuning GPT-OSS or exploring advanced GPU programming, this could be a useful starting point.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
18 August 2025
88 articles
Related Articles
Related Articles
More Stories