
Share
Language models' penchant for em-dashes puzzles both writers and readers alike, with some humans even ditching this punctuation mark to stand out from AI-generated text. But why exactly do machines love em-dashes so much?
If you’ve spent any time interacting with language models, you’ve likely noticed their fondness for the em-dash, like this. It’s become so prevalent that some human writers have abandoned it to avoid being mistaken for AI-generated text. Despite its ubiquity, the exact reason why language models overuse em-dashes remains a bit of a mystery.
One frequently cited explanation is that em-dashes are common in the training data, leading models to mimic this behavior. However, if em-dashes were as prevalent in AI-generated text as they are in human writing, their use wouldn’t stand out so much. This suggests there’s more to it than just learned behavior.
Another theory is that em-dashes offer versatility: they can either continue a thought or introduce a new one, making them a safe choice for models trying to predict the next token. While other punctuation marks are similarly flexible (like commas or semicolons), this explanation doesn’t fully account for why em-dashes specifically are overused.
Some argue that em-dashes are favored because they allow for more concise writing, which aligns with the brevity bias in model training. However, experiments with OpenAI’s tokenizer show that em-dashes aren’t inherently more efficient than other punctuation marks. For example, many common patterns (e.g., “it’s not X, it’s Y”) could be replaced with a comma without losing brevity. Moreover, if GPT-4 were so focused on brevity, it could simply reduce verbosity in other ways.

One intriguing theory is that em-dash overuse might reflect the local English dialect of the human testers involved in Reinforcement Learning with Human Feedback (RLHF). During the final stage of training, hundreds of human evaluators provide feedback to refine model outputs. If these testers predominantly use em-dashes in their own writing, they might inadvertently bias the model towards this punctuation.
For practitioners and researchers, understanding the root cause of this behavior is crucial. It highlights the importance of diverse training data and the potential influence of human feedback in shaping model outputs. If em-dash overuse is indeed a result of RLHF biases, it underscores the need for more nuanced evaluation processes to ensure that models don’t adopt idiosyncratic writing styles.
The overuse of em-dashes by language models remains an open question with several plausible but unconvincing explanations. While training data and model behavior are significant factors, the RLHF process and its potential biases offer a promising avenue for further investigation. As AI-generated text continues to evolve, understanding these nuances will be key to creating more natural and diverse writing styles.
Tags
Original Sources
↗ https://www.seangoedecke.com/em-dashes/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
3 November 2025
133 articles
Related Articles
Related Articles
More Stories