
Share
Researchers Sander Land and Max Bartolo unveil a method to detect under-trained tokens in large language models, shedding light on glitches caused by rarely seen vocabulary entries that can lead to erratic model behavior.
In a recent paper titled "Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models," researchers Sander Land and Max Bartolo delve into the issue of under-trained tokens in large language models (LLMs). These tokens, which are present in the tokenizer vocabulary but rarely or never seen during model training, can lead to unexpected and often problematic behavior. The paper, accepted at EMNLP 2024, offers a detailed analysis and novel methods for identifying these glitch tokens.
The key technical advancement is a systematic approach to detecting under-trained tokens. Traditionally, the creation of tokenizers (which convert text into numerical inputs) and the training of LLMs have been somewhat decoupled processes. This disconnect can result in tokens that are part of the tokenizer but not adequately represented in the model's training data. The authors address this by combining:
For practitioners, understanding and addressing under-trained tokens is crucial for several reasons:
The authors' analysis reveals several important insights:

The researchers employed a multi-faceted approach:
Tokenizer Analysis:
Model Weight-Based Indicators:
Prompting Techniques:
The researchers implemented their methods using a combination of Python and PyTorch. They tested their approach on several popular LLMs, including:
Their findings demonstrated that the proposed methods effectively detected under-trained tokens with high accuracy. For example, in GPT-3, they identified over 10% of the tokenizer vocabulary as under-trained.
The paper by Land and Bartolo provides a robust framework for detecting under-trained tokens in LLMs. By combining tokenizer analysis, model weight-based indicators, and prompting techniques, practitioners can better understand and mitigate the risks associated with these problematic tokens. This work is a significant step towards improving the safety, efficiency, and reliability of large language models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 May 2024
133 articles
Related Articles
Related Articles
More Stories