TLDR: This research investigates why Transformers struggle with multi-digit multiplication. By reverse-engineering a successful “Implicit Chain-of-Thought” (ICoT) model, the authors found it learns crucial long-range dependencies through attention-based “trees” and represents digits efficiently using Fourier bases. Standard models fail because they get stuck in local optima, unable to form these dependencies. Introducing an auxiliary loss that predicts intermediate sums can provide the necessary inductive bias, enabling models to learn multiplication successfully.
Large language models, despite their impressive capabilities in various complex tasks, often stumble on what appears to be a straightforward problem: multi-digit multiplication. A new research paper delves into this puzzling limitation, reverse-engineering a model that successfully performs multiplication to understand why standard Transformers fail and how this challenge can be overcome.
The study, titled WHY CAN’T TRANSFORMERS LEARN MULTIPLICATION? REVERSE-ENGINEERING REVEALS LONG-RANGE DEPENDENCY PITFALLS, was conducted by Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Vi´egas, Martin Wattenberg, and Andrew Lee. Their work highlights a critical pitfall in how Transformers learn long-range dependencies, especially when trained with standard methods.
The Core Problem: Long-Range Dependencies
Multi-digit multiplication is not just about simple pairwise products; it requires combining these products and managing carries across multiple positions, creating complex long-range dependencies. For instance, calculating a middle digit of the product needs information from several input digits and previous carry values. The researchers found that standard fine-tuned (SFT) Transformer models consistently fail to learn these crucial dependencies, leading to poor performance even with billions of parameters.
Implicit Chain-of-Thought: A Path to Success
To understand what a successful model does differently, the team studied a Transformer trained with Implicit Chain-of-Thought (ICoT). ICoT models are initially trained with explicit step-by-step calculation tokens, which are gradually removed during training. This process forces the model to internalize the intermediate reasoning steps within its latent states. The ICoT model achieved 100% accuracy on 4×4 digit multiplication, a task where SFT models barely reached 1%.
Key Insights from Reverse-Engineering the ICoT Model
The reverse-engineering process revealed three major findings about how the ICoT model masters multiplication:
1. Evidence of Long-Range Structure: Through techniques like logit attributions (measuring how much each input digit influences output digits) and linear probes (decoding intermediate sums from hidden states), the researchers confirmed that the ICoT model successfully encodes the necessary long-range dependencies. In contrast, SFT models showed a lack of these connections, particularly for middle digits.
2. Mechanism: Attention Trees: The ICoT model uses its attention mechanism to construct a sparse, binary-tree-like graph. In the first layer, attention heads focus on pairs of digits to compute and “cache” partial products in earlier tokens. The second layer then efficiently retrieves these cached partial products from specific “cache sites” to compute the final output digits. This structured attention pattern is crucial for handling the long-range dependencies.
3. Geometry: Minkowski Sums and Fourier Bases: The study also uncovered an elegant internal representation. Attention heads realize digit-wise partial products as Minkowski sums of digit embeddings. Furthermore, the model represents digits using a Fourier basis, leading to a striking “pentagonal prism” structure in the model’s hidden states. This geometric organization provides an intuitive and efficient way for the model to handle numerical information, a feature absent in the failing SFT models.
Why Standard Fine-Tuning Fails
The researchers observed that during standard fine-tuning, models quickly learn the first, last, and sometimes the second digits of the product. However, they consistently struggle with the middle digits (c3 to c6 in a 4×4 multiplication). The loss for these middle digits plateaus, and gradient norms drop, indicating that the model gets stuck in a local optimum. This local optimum lacks the ability to form the necessary long-range dependencies required for accurate middle-digit calculation. Importantly, simply scaling up the SFT model (e.g., to 12 layers and 8 heads) did not resolve this issue.
A Simple Fix: Auxiliary Loss for Inductive Bias
To validate their understanding, the team introduced a simple auxiliary loss during training. This loss encourages the model to predict an intermediate “running partial sum” (denoted as ĉk) at each output timestep. By adding a lightweight linear regression head to the Transformer’s second layer, trained with a Mean Squared Error (MSE) loss, the model was guided to learn the proper long-range dependencies. This inductive bias allowed a 2-layer model to achieve 99% accuracy on 4×4 multiplication, without needing explicit chain-of-thought tokens.
Interestingly, the model trained with this auxiliary loss also developed similar attention tree mechanisms to the ICoT model, and in some cases, even learned to attend to all necessary digits simultaneously, forming a parallelogram-like attention pattern.
Also Read:
- Unpacking Transformer Reasoning: New Insights into Multi-step Logic Limits
- The Dual Strengths of AI Reasoning: Chain-of-Thought for Approximation, Latent Thought for Parallelism
Conclusion
This research provides a clear explanation for why Transformers struggle with multi-digit multiplication: a fundamental difficulty in learning long-range dependencies under standard training regimes. By reverse-engineering a successful ICoT model, the study revealed the mechanisms (attention trees) and representations (Fourier bases forming a pentagonal prism) that enable this capability. The introduction of a task-specific inductive bias through an auxiliary loss demonstrates a promising direction for addressing this limitation, suggesting that carefully designed training objectives can guide Transformers to learn complex algorithmic tasks more effectively.


