
Share
Researchers uncover why advanced language models like transformers flounder at basic multi-digit multiplication, revealing flaws in their long-range dependency mechanisms.
Transformers, the backbone of modern language models, have shown remarkable capabilities in a wide range of tasks. However, they still struggle with seemingly simple tasks like multi-digit multiplication. In a recent study by researchers from the University of Chicago, MIT, University of Waterloo, Harvard University, and Google DeepMind, the team delved into why these powerful models fail at this task and uncovered some intriguing insights.
The researchers compared two types of models: a standard fine-tuned model (SFT) and a model trained with implicit chain-of-thought (ICoT). The SFT model, despite its large number of parameters, fails to learn multi-digit multiplication. In contrast, the ICoT model successfully learns the task by internalizing intermediate steps during training.
Evidence of Long-Range Structure:
Mechanism: Directed Acyclic Graph (DAG) Construction:
Geometry of Representations:

The study also explored why the SFT model fails to learn multi-digit multiplication. The researchers found that during fine-tuning, the SFT model converges to a local optimum that does not capture the required long-range dependencies. To address this issue, they introduced an auxiliary loss function that predicts the "running sum" via a linear regression probe.
This research highlights the importance of understanding and addressing the limitations of Transformers, particularly in tasks that require managing long-range dependencies. By reverse-engineering successful models and identifying effective mechanisms, researchers can develop more robust training strategies and inductive biases.
In summary, this study not only uncovers a significant pitfall in learning long-range dependencies with Transformers but also provides actionable insights for improving model performance on algorithmic tasks like multi-digit multiplication.
Tags
Original Sources
↗ https://arxiv.org/pdf/2510.00184
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
7 October 2025
88 articles
Related Articles
Related Articles
More Stories