Muon Optimizer Boosts Training Speed for NanoGPT and CIFAR-10

Models & Research

The Engineer

24 Dec 2024 · 3 min read

Developed by researchers to accelerate training times, Muon outperforms traditional optimizers on complex models like NanoGPT and CIFAR-10, promising faster development cycles and more efficient AI innovation.

Muon, a novel optimizer designed specifically for the hidden layers of neural networks, has been making waves in the machine learning community. It's already been used to set new training speed records for both NanoGPT and CIFAR-10 speedrunning. In this article, we’ll dive into the technical details of Muon, its design, and why it’s been so effective.

What is Muon?

Muon is an optimizer tailored for 2D parameters in neural network hidden layers. It leverages a modified version of the Newton-Schulz matrix iteration to improve convergence and training speed. The core equation for Muon can be defined as follows:

[ \text{Muon}(G) = \text{NewtonSchulz5}(G) ]

Where NewtonSchulz5 is a specific matrix iteration that can be implemented in PyTorch like this:

def newtonschulz5(G, steps=5, eps=1e-7):
    assert G.ndim == 2
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    X /= (X.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

How to Use Muon

When using Muon in your neural network, it's important to note that scalar and vector parameters, as well as the input and output layers, should be optimized using a standard method like AdamW. Muon is specifically designed for 2D parameters, but it can also handle 4D convolutional parameters by flattening their last three dimensions.

For example, in the current NanoGPT speedrun record, Muon is used as follows:

# Example usage in NanoGPT
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for param in model.transformer.h:
    optimizer.add_param_group({'params': muon.newtonschulz5(param)})

Empirical Results

Muon has demonstrated significant improvements in training speed for several benchmarks:

CIFAR-10: Improved the speed record for training to 94% accuracy from 3.3 A100-seconds to 2.6 A100-seconds. This represents a substantial reduction in training time, making it more feasible to train models on resource-constrained hardware.
NanoGPT Speedrunning: Reduced the training time for achieving a validation loss of 3.28 on FineWeb by a factor of 1.35x. This competitive task involves fine-tuning large language models, and Muon's impact is particularly notable given the complexity of these models.

Design Details and Connections to Prior Research

The Newton-Schulz iteration used in Muon has roots in numerical linear algebra and matrix analysis. It's a method for computing the inverse square root of a matrix, which can be useful in various optimization contexts. The specific coefficients (a, b, c) in the newtonschulz5 function are chosen to balance convergence speed and stability.

Convergence: The Newton-Schulz iteration is known for its quadratic convergence rate, meaning it quickly approaches the correct solution with each step.
Stability: The use of bfloat16 and normalization helps maintain numerical stability, especially when dealing with large matrices.

Why It Works

The effectiveness of Muon can be attributed to several factors:

Matrix Inversion: By efficiently computing the inverse square root