
Share
Researchers reveal filler tokens in transformer models boost computational power despite their apparent insignificance, sparking debate on model transparency and ethical considerations.
In a recent paper titled "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models," Jacob Pfau, William Merrill, and Samuel R. Bowman explore the role of filler tokens in transformer language models (TLMs). The study delves into how these seemingly meaningless tokens can significantly improve model performance on complex tasks, raising important questions about transparency and computational efficiency.
Traditionally, chain-of-thought (CoT) responses have been used to break down complex problems into simpler steps, enhancing the performance of language models. However, this paper reveals that transformers can achieve similar performance gains using filler tokens-sequences of meaningless characters like '......'-instead of meaningful CoT tokens. This finding is significant because it suggests that the computational benefits of additional tokens are not solely dependent on their semantic content.
Filler Tokens Enhance Performance: Transformers were able to solve two hard algorithmic tasks using filler tokens, which they could not solve without these intermediate tokens.
Learning to Use Filler Tokens is Challenging: The researchers found that training models to effectively use filler tokens requires dense supervision and specific training techniques. Without this, the models struggle to converge.
Theoretical Characterization: The study provides a theoretical framework for understanding when filler tokens are beneficial. This is characterized by the quantifier depth of first-order logic formulas.

Transparency Concerns: The ability of models to perform hidden computations using filler tokens raises concerns about transparency and audibility. This could make it difficult to understand how a model arrives at its conclusions.
Training Considerations: The difficulty in training models to use filler tokens effectively means that practitioners need to invest more time and resources in dense supervision techniques.
The researchers used a transformer architecture for their experiments, which is consistent with the standard setup for language models. They trained models on a variety of tasks, including algorithmic problems and natural language benchmarks, to evaluate the impact of filler tokens.
Benchmarks:
Model Architecture:
The paper "Let's Think Dot by Dot" challenges the conventional understanding of chain-of-thought responses in transformers. It demonstrates that additional tokens, even if they are meaningless filler, can provide computational benefits. This insight has implications for both the efficiency and transparency of language models, highlighting the need for further research into how these models process information.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
29 April 2024
88 articles
Related Articles
Related Articles
More Stories