
Share
While many believe Pytorch’s AdamW optimizer separately manages weight decay and learning rate, its implementation reveals a tighter coupling than expected, challenging conventional tuning practices.
If you've been using AdamW in Pytorch, you might have assumed it decouples weight decay from the learning rate. However, a closer look reveals that this isn't entirely true due to how AdamW is implemented in Pytorch. This article delves into the technical details and provides a practical tuning strategy to address the issue.
AdamW, introduced by Loshchilov and Hutter, is a popular optimization algorithm for training large-scale machine learning models. It's an extension of the Adam optimizer that aims to handle weight decay more effectively. Here’s a quick overview of how it works:
The original AdamW update rule, as proposed by Loshchilov and Hutter, is:
[ w_{t+1} = (1 - \lambda \eta_t) w_t - \alpha_t D_t^{-1} \hat{m}_t ]
The key idea behind AdamW is to decouple weight decay from the learning rate. In practice, this means that changing the learning rate should not affect the weight decay. This is important because it allows for more flexible tuning of hyperparameters.
Pytorch’s implementation of AdamW differs slightly from the original paper:
[ w_{t+1} = (1 - \lambda) w_t - \alpha_t D_t^{-1} \hat{m}_t ]

Notice that in Pytorch, the weight decay term ( \lambda ) is not scaled by the learning rate scheduler ( \eta_t ). This means that when you adjust the learning rate, the effective weight decay remains constant.
This implementation detail can lead to suboptimal performance if not accounted for. Specifically, doubling the learning rate without adjusting the weight decay can result in over-regularization or under-regularization, depending on the initial values of ( \alpha ) and ( \lambda ).
To address this issue, you should adjust the weight decay when changing the learning rate. The rule of thumb is: [ \text{When doubling the learning rate, halve the weight decay} ]
This ensures that the effective regularization remains consistent across different learning rates.
Let’s say you start with a learning rate ( \alpha = 0.001 ) and a weight decay ( \lambda = 0.01 ). If you decide to double the learning rate to ( \alpha = 0.002 ), you should halve the weight decay to ( \lambda = 0.005 ).
While AdamW is a powerful optimization algorithm, its implementation in Pytorch does not fully decouple weight decay from the learning rate. By understanding this subtlety and adjusting your tuning strategy accordingly, you can achieve better performance and more consistent results in your machine learning
Tags
Original Sources
↗ https://fabian-sp.github.io/posts/2024/02/decoupling/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
20 February 2024
133 articles
Related Articles
Related Articles
More Stories