HEADLINE: Reverse-Engineering Transformers Reveals Long-Range Dependency Pitfalls in Multi-Digit Multiplication

Models & Research

The Engineer

7 Oct 2025 · 3 min read

Researchers uncover why advanced language models like transformers flounder at basic multi-digit multiplication, revealing flaws in their long-range dependency mechanisms.

Transformers, the backbone of modern language models, have shown remarkable capabilities in a wide range of tasks. However, they still struggle with seemingly simple tasks like multi-digit multiplication. In a recent study by researchers from the University of Chicago, MIT, University of Waterloo, Harvard University, and Google DeepMind, the team delved into why these powerful models fail at this task and uncovered some intriguing insights.

What Changed Technically?

The researchers compared two types of models: a standard fine-tuned model (SFT) and a model trained with implicit chain-of-thought (ICoT). The SFT model, despite its large number of parameters, fails to learn multi-digit multiplication. In contrast, the ICoT model successfully learns the task by internalizing intermediate steps during training.

Key Findings

Evidence of Long-Range Structure:
- Logit attributions and linear probes revealed that the ICoT model encodes the necessary long-range dependencies for multi-digit multiplication.
- This suggests that the model can maintain and use information across multiple digits, which is crucial for tasks like multiplication.
Mechanism: Directed Acyclic Graph (DAG) Construction:
- The ICoT model uses attention mechanisms to construct a DAG, effectively "caching" and "retrieving" pairwise partial products.
- This mechanism allows the model to manage long-range dependencies efficiently by breaking down the multiplication into smaller, manageable steps.
Geometry of Representations:
- The ICoT model represents digits using a Fourier basis, which is an intuitive and efficient representation for handling multi-digit operations.
- Partial products are formed using Minkowski sums between pairs of digits, another efficient method that the standard fine-tuned model lacks.

Learning Dynamics and Inductive Bias

The study also explored why the SFT model fails to learn multi-digit multiplication. The researchers found that during fine-tuning, the SFT model converges to a local optimum that does not capture the required long-range dependencies. To address this issue, they introduced an auxiliary loss function that predicts the "running sum" via a linear regression probe.

Auxiliary Loss:
- This inductive bias helps the model learn the necessary long-range dependencies by providing additional guidance during training.
- With this modification, the SFT model was able to successfully learn multi-digit multiplication.

Implications for Practitioners

This research highlights the importance of understanding and addressing the limitations of Transformers, particularly in tasks that require managing long-range dependencies. By reverse-engineering successful models and identifying effective mechanisms, researchers can develop more robust training strategies and inductive biases.

Practical Takeaways:
- Use attention mechanisms to manage long-range dependencies.
- Consider using Fourier basis representations for efficient digit handling.
- Introduce auxiliary loss functions to guide the model towards capturing necessary dependencies.

In summary, this study not only uncovers a significant pitfall in learning long-range dependencies with Transformers but also provides actionable insights for improving model performance on algorithmic tasks like multi-digit multiplication.