Deriving Muon: A Theoretical Approach to Optimizing Linear Layers

Models & Research

The Engineer

10 Mar 2025 · 3 min read

Muon stands out in the realm of neural network optimization for its rigorous theoretical foundation, distinguishing it from heuristic approaches like Adam and setting new performance benchmarks.

Boston, 7 Mar 2025

Particle tracks in a bubble chamber. Fermilab.

Recently, we introduced Muon, a novel neural network optimizer that has gained significant attention for its exceptional performance. Notably, Muon was used to set speed records with NanoGPT, which caught the eye of major research labs (see this paper).

What sets Muon apart is its derivation from an exact theoretical principle, unlike popular optimizers like Adam, which have more heuristic origins and often converge slower than Muon (as shown in Keller’s benchmarks). In this article, I’ll walk through the derivation of Muon, providing context that may help researchers extend these methods to new layer types.

📘 Muon is a collaborative effort with Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, and Laker Newhouse. For more implementation details, check out Keller’s write-up.

What is Muon?

Muon is an optimizer specifically designed for Linear layers in neural networks. A Linear layer takes an input vector ( \mathbf{x} ) and multiplies it by a weight matrix ( \mathbf{W} ) to produce an output vector ( \mathbf{y} = \mathbf{Wx} ). The vectors ( \mathbf{x} ) and ( \mathbf{y} ) are expected to be "dense" activation vectors, meaning their entries are close to unit size. This distinguishes Linear layers from other types like Embedding layers, which handle one-hot inputs.

The Linear layer is a fundamental building block in neural networks, making it crucial to optimize effectively.

Theoretical Foundation

One of our broader goals is to "modularize" the theory of deep learning. This involves breaking down the architecture into individual components, deriving theories and algorithms for each part, and then figuring out how to integrate them seamlessly at the end. Think of it as building a complex system with Lego bricks-each brick (or layer) needs to be well-designed before you can assemble the whole structure.

To handle individual layers, Muon normalizes weight updates in a way that, given the input structure, automatically induces desirable effects on the outputs. This approach is inspired by the extensive work done on normalization techniques like batch norm, layer norm, and RMS norm.

Derivation of Muon

The key to Muon's effectiveness lies in its theoretical derivation, which ensures that the weight updates are normalized in a way that aligns with the input structure. Here’s a step-by-step breakdown:

Step 1: Define the Objective
- The goal is to minimize the loss function ( L ) with respect to the weights ( \mathbf{W} ).
- This involves computing the gradient of the loss with respect to ( \mathbf{W} ), denoted as ( \nabla_{\mathbf{W}} L ).
Step 2: Normalize the Gradient
- To ensure that the updates are scaled appropriately, we normalize the gradient by a factor that depends on the input structure.
- Specifically, for each weight ( W_{ij} ), the update is normalized by the norm of the corresponding input vector ( \mathbf{x}_i ).
Step 3: Update Rule
- The update rule for Muon can be written as: [ W_{ij} \leftarrow W_{ij} - \eta \frac{\nabla_{W_{ij}} L}{\