
Share
Muon stands out in the realm of neural network optimization for its rigorous theoretical foundation, distinguishing it from heuristic approaches like Adam and setting new performance benchmarks.
Boston, 7 Mar 2025
Particle tracks in a bubble chamber. Fermilab.
Recently, we introduced Muon, a novel neural network optimizer that has gained significant attention for its exceptional performance. Notably, Muon was used to set speed records with NanoGPT, which caught the eye of major research labs (see this paper).
What sets Muon apart is its derivation from an exact theoretical principle, unlike popular optimizers like Adam, which have more heuristic origins and often converge slower than Muon (as shown in Keller’s benchmarks). In this article, I’ll walk through the derivation of Muon, providing context that may help researchers extend these methods to new layer types.
📘 Muon is a collaborative effort with Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, and Laker Newhouse. For more implementation details, check out Keller’s write-up.
Muon is an optimizer specifically designed for Linear layers in neural networks. A Linear layer takes an input vector ( \mathbf{x} ) and multiplies it by a weight matrix ( \mathbf{W} ) to produce an output vector ( \mathbf{y} = \mathbf{Wx} ). The vectors ( \mathbf{x} ) and ( \mathbf{y} ) are expected to be "dense" activation vectors, meaning their entries are close to unit size. This distinguishes Linear layers from other types like Embedding layers, which handle one-hot inputs.
The Linear layer is a fundamental building block in neural networks, making it crucial to optimize effectively.

One of our broader goals is to "modularize" the theory of deep learning. This involves breaking down the architecture into individual components, deriving theories and algorithms for each part, and then figuring out how to integrate them seamlessly at the end. Think of it as building a complex system with Lego bricks-each brick (or layer) needs to be well-designed before you can assemble the whole structure.
To handle individual layers, Muon normalizes weight updates in a way that, given the input structure, automatically induces desirable effects on the outputs. This approach is inspired by the extensive work done on normalization techniques like batch norm, layer norm, and RMS norm.
The key to Muon's effectiveness lies in its theoretical derivation, which ensures that the weight updates are normalized in a way that aligns with the input structure. Here’s a step-by-step breakdown:
Step 1: Define the Objective
Step 2: Normalize the Gradient
Step 3: Update Rule
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
10 March 2025
88 articles
Related Articles
Related Articles
More Stories