
Share
Researchers at OLMo reveal how semantic duplicates in large language model training data skew benchmark results, particularly affecting out-of-distribution performance and challenging existing evaluation standards.
A new paper from researchers at OLMo has shed light on a critical issue in large language model (LLM) training: the presence of semantic duplicates in training corpora and their impact on benchmark performance. The study, which focuses on OLMo 3-a model with open training data-reveals that these duplicates can significantly skew results, especially when it comes to out-of-distribution (OOD) performance.
The key technical insight is the prevalence of semantic duplicates in LLM training corpora. These are not just exact matches but semantically equivalent pieces of text that can lead to local generalization, where models perform well on benchmarks by pattern-matching rather than true reasoning.
For practitioners and researchers, this means that benchmark scores may not accurately reflect a model's true capabilities. If models are memorizing or pattern-matching to semantically equivalent data, they might perform well on benchmarks but fail in real-world, out-of-distribution scenarios. This is particularly concerning for applications like natural language understanding, code generation, and logical reasoning.
Detecting semantic duplicates is a computationally intensive task. Here’s how the researchers approached it:
Embedding and Search:
Categorization and Partitioning:

To estimate the impact of semantic duplicates, the researchers:
For those involved in LLM development and evaluation, this study highlights the need for more rigorous decontamination of training corpora. Here are some steps you can take:
Advanced Decontamination Techniques:
Translating Test Sets:
OOD Evaluation:
The presence of semantic duplicates in LLM training corpora is a significant issue that can lead to inflated benchmark scores and misleading performance metrics. By adopting more advanced decontamination techniques, researchers and practitioners can better ensure that their models are truly capable of generalizing and reasoning, not just memorizing or pattern-matching.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 February 2026
88 articles
Related Articles
Related Articles
More Stories