Pytorch's AdamW Implementation Doesn't Fully Decouple Weight Decay and Learning Rate

Models & Research

The Engineer

20 Feb 2024 · 3 min read

While many believe Pytorch’s AdamW optimizer separately manages weight decay and learning rate, its implementation reveals a tighter coupling than expected, challenging conventional tuning practices.

If you've been using AdamW in Pytorch, you might have assumed it decouples weight decay from the learning rate. However, a closer look reveals that this isn't entirely true due to how AdamW is implemented in Pytorch. This article delves into the technical details and provides a practical tuning strategy to address the issue.

The AdamW Algorithm

AdamW, introduced by Loshchilov and Hutter, is a popular optimization algorithm for training large-scale machine learning models. It's an extension of the Adam optimizer that aims to handle weight decay more effectively. Here’s a quick overview of how it works:

Stochastic Gradient Calculation: In each iteration ( t ), we compute the stochastic gradient ( g_t = \nabla \ell(w_t, x_t) ), where ( w_t ) are the parameters and ( x_t ) is a batch of data.
First and Second Moment Estimates:
- First moment (mean): ( m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t )
- Second moment (uncentered variance): ( v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t \odot g_t )
Bias Correction:
- Bias-corrected first moment: ( \hat{m}_t = m_t / (1 - \beta_1^t) )
- Bias-corrected second moment: ( \hat{v}_t = v_t / (1 - \beta_2^t) )
Preconditioner: ( D_t = \text{diag}(\epsilon + \hat{v}_t) )

The original AdamW update rule, as proposed by Loshchilov and Hutter, is: [ w_{t+1} = (1 - \lambda \eta_t) w_t - \alpha_t D_t^{-1} \hat{m}_t ]

Decoupling Weight Decay and Learning Rate

The key idea behind AdamW is to decouple weight decay from the learning rate. In practice, this means that changing the learning rate should not affect the weight decay. This is important because it allows for more flexible tuning of hyperparameters.

Weight Decay: Regularizes the model by penalizing large weights.
Learning Rate: Controls the step size during gradient descent.

Pytorch's Implementation

Pytorch’s implementation of AdamW differs slightly from the original paper: [ w_{t+1} = (1 - \lambda) w_t - \alpha_t D_t^{-1} \hat{m}_t ]

Notice that in Pytorch, the weight decay term ( \lambda ) is not scaled by the learning rate scheduler ( \eta_t ). This means that when you adjust the learning rate, the effective weight decay remains constant.

Why It Matters

This implementation detail can lead to suboptimal performance if not accounted for. Specifically, doubling the learning rate without adjusting the weight decay can result in over-regularization or under-regularization, depending on the initial values of ( \alpha ) and ( \lambda ).

Practical Tuning Strategy

To address this issue, you should adjust the weight decay when changing the learning rate. The rule of thumb is: [ \text{When doubling the learning rate, halve the weight decay} ]

This ensures that the effective regularization remains consistent across different learning rates.

Example

Let’s say you start with a learning rate ( \alpha = 0.001 ) and a weight decay ( \lambda = 0.01 ). If you decide to double the learning rate to ( \alpha = 0.002 ), you should halve the weight decay to ( \lambda = 0.005 ).

Conclusion

While AdamW is a powerful optimization algorithm, its implementation in Pytorch does not fully decouple weight decay from the learning rate. By understanding this subtlety and adjusting your tuning strategy accordingly, you can achieve better performance and more consistent results in your machine learning