The Em-Dash Enigma: Why AI Models Overuse This Punctuation Mark

Models & Research

The Engineer

3 Nov 2025 · 3 min read

Language models' penchant for em-dashes puzzles both writers and readers alike, with some humans even ditching this punctuation mark to stand out from AI-generated text. But why exactly do machines love em-dashes so much?

If you’ve spent any time interacting with language models, you’ve likely noticed their fondness for the em-dash, like this. It’s become so prevalent that some human writers have abandoned it to avoid being mistaken for AI-generated text. Despite its ubiquity, the exact reason why language models overuse em-dashes remains a bit of a mystery.

Common Explanations and Their Shortcomings

Training Data Overrepresentation

One frequently cited explanation is that em-dashes are common in the training data, leading models to mimic this behavior. However, if em-dashes were as prevalent in AI-generated text as they are in human writing, their use wouldn’t stand out so much. This suggests there’s more to it than just learned behavior.

Versatility and Safety

Another theory is that em-dashes offer versatility: they can either continue a thought or introduce a new one, making them a safe choice for models trying to predict the next token. While other punctuation marks are similarly flexible (like commas or semicolons), this explanation doesn’t fully account for why em-dashes specifically are overused.

Brevity and Token Efficiency

Some argue that em-dashes are favored because they allow for more concise writing, which aligns with the brevity bias in model training. However, experiments with OpenAI’s tokenizer show that em-dashes aren’t inherently more efficient than other punctuation marks. For example, many common patterns (e.g., “it’s not X, it’s Y”) could be replaced with a comma without losing brevity. Moreover, if GPT-4 were so focused on brevity, it could simply reduce verbosity in other ways.

Could RLHF Be the Culprit?

One intriguing theory is that em-dash overuse might reflect the local English dialect of the human testers involved in Reinforcement Learning with Human Feedback (RLHF). During the final stage of training, hundreds of human evaluators provide feedback to refine model outputs. If these testers predominantly use em-dashes in their own writing, they might inadvertently bias the model towards this punctuation.

RLHF Process and Potential Bias

Human Evaluators: The RLHF process involves a large number of human testers who interact with the model and rate its responses.
Feedback Loop: This feedback is used to fine-tune the model, potentially reinforcing certain stylistic choices, including em-dash usage.
Dialect Influence: If these evaluators predominantly use em-dashes in their own writing, it could lead to a bias in the model’s output.

Implications for Practitioners

For practitioners and researchers, understanding the root cause of this behavior is crucial. It highlights the importance of diverse training data and the potential influence of human feedback in shaping model outputs. If em-dash overuse is indeed a result of RLHF biases, it underscores the need for more nuanced evaluation processes to ensure that models don’t adopt idiosyncratic writing styles.

Conclusion

The overuse of em-dashes by language models remains an open question with several plausible but unconvincing explanations. While training data and model behavior are significant factors, the RLHF process and its potential biases offer a promising avenue for further investigation. As AI-generated text continues to evolve, understanding these nuances will be key to creating more natural and diverse writing styles.