
Share
Apple's new Sigma Reparametrization technique stabilizes transformer training by rethinking how models handle attention mechanisms, tackling problems like entropy collapse without sacrificing performance.
Apple has released a new technique called Sigma Reparametrization aimed at improving the training stability of transformer models. This method, available in their open-source repository ml-sigma-reparam, addresses common issues like entropy collapse and attention matrix instability during training.
The core innovation in Sigma Reparametrization lies in how it handles the attention mechanism within transformers. Traditionally, the attention scores are computed using a dot-product between query and key vectors, followed by a softmax function to normalize these scores. However, this process can lead to instability, particularly when the model is deep or the batch size is large.
For practitioners, these changes offer several practical benefits:
The ml-sigma-reparam repository provides both speech and vision modules, demonstrating the versatility of this technique across different domains. Here’s a breakdown of the key components:

Apple reports significant improvements in both speech and vision tasks:
To get started with Sigma Reparametrization, you can clone the repository and follow the provided instructions:
git clone https://github.com/apple/ml-sigma-reparam.git
cd ml-sigma-reparam
The repository includes detailed documentation and example scripts for training models on both speech and vision tasks.
Sigma Reparametrization represents a significant step forward in improving the stability and performance of transformer models. By addressing common issues like entropy collapse, this technique can help researchers and practitioners build more robust and reliable models across various domains.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
18 December 2023
133 articles
Related Articles
Related Articles
More Stories