
Share
Developed by researchers to accelerate training times, Muon outperforms traditional optimizers on complex models like NanoGPT and CIFAR-10, promising faster development cycles and more efficient AI innovation.
Muon, a novel optimizer designed specifically for the hidden layers of neural networks, has been making waves in the machine learning community. It's already been used to set new training speed records for both NanoGPT and CIFAR-10 speedrunning. In this article, we’ll dive into the technical details of Muon, its design, and why it’s been so effective.
Muon is an optimizer tailored for 2D parameters in neural network hidden layers. It leverages a modified version of the Newton-Schulz matrix iteration to improve convergence and training speed. The core equation for Muon can be defined as follows:
[ \text{Muon}(G) = \text{NewtonSchulz5}(G) ]
Where NewtonSchulz5 is a specific matrix iteration that can be implemented in PyTorch like this:
def newtonschulz5(G, steps=5, eps=1e-7):
assert G.ndim == 2
a, b, c = (3.4445, -4.7750, 2.0315)
X = G.bfloat16()
X /= (X.norm() + eps)
if G.size(0) > G.size(1):
X = X.T
for _ in range(steps):
A = X @ X.T
B = b * A + c * A @ A
X = a * X + B @ X
if G.size(0) > G.size(1):
X = X.T
return X
When using Muon in your neural network, it's important to note that scalar and vector parameters, as well as the input and output layers, should be optimized using a standard method like AdamW. Muon is specifically designed for 2D parameters, but it can also handle 4D convolutional parameters by flattening their last three dimensions.
For example, in the current NanoGPT speedrun record, Muon is used as follows:
# Example usage in NanoGPT
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for param in model.transformer.h:
optimizer.add_param_group({'params': muon.newtonschulz5(param)})

Muon has demonstrated significant improvements in training speed for several benchmarks:
The Newton-Schulz iteration used in Muon has roots in numerical linear algebra and matrix analysis. It's a method for computing the inverse square root of a matrix, which can be useful in various optimization contexts. The specific coefficients (a, b, c) in the newtonschulz5 function are chosen to balance convergence speed and stability.
bfloat16 and normalization helps maintain numerical stability, especially when dealing with large matrices.The effectiveness of Muon can be attributed to several factors:
Tags
Original Sources
↗ https://kellerjordan.github.io/posts/muon/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 December 2024
88 articles
Related Articles
Related Articles
More Stories