Uncovering Semantic Duplicates in LLM Training Corpora: Implications for Benchmark Performance

Models & Research

The Engineer

17 Feb 2026 · 3 min read

Researchers at OLMo reveal how semantic duplicates in large language model training data skew benchmark results, particularly affecting out-of-distribution performance and challenging existing evaluation standards.

A new paper from researchers at OLMo has shed light on a critical issue in large language model (LLM) training: the presence of semantic duplicates in training corpora and their impact on benchmark performance. The study, which focuses on OLMo 3-a model with open training data-reveals that these duplicates can significantly skew results, especially when it comes to out-of-distribution (OOD) performance.

What Changed?

The key technical insight is the prevalence of semantic duplicates in LLM training corpora. These are not just exact matches but semantically equivalent pieces of text that can lead to local generalization, where models perform well on benchmarks by pattern-matching rather than true reasoning.

Key Findings:
- 50% of the ZebraLogic test set had exact duplicates in OLMo 3's training corpus.
- 78% of the CodeForces test set had at least one semantic duplicate.
- The estimated rate of semantic duplicates is greater than 4 in 10,000.

Why It Matters

For practitioners and researchers, this means that benchmark scores may not accurately reflect a model's true capabilities. If models are memorizing or pattern-matching to semantically equivalent data, they might perform well on benchmarks but fail in real-world, out-of-distribution scenarios. This is particularly concerning for applications like natural language understanding, code generation, and logical reasoning.

Technical Details

Detection of Semantic Duplicates

Detecting semantic duplicates is a computationally intensive task. Here’s how the researchers approached it:

Embedding and Search:
- The entire training corpus was embedded using an LLM.
- These embeddings were then searched for vectors close to the test data embeddings.
Categorization and Partitioning:
- The corpus was categorized into relevant partitions (e.g., maths > number theory).
- Intense searches were conducted within these partitions to find potential duplicates.

Filter Model:
- A smaller, 300M parameter model was trained as a filter.
- This filter helped identify and remove likely duplicates from the training data.

Impact on Benchmark Scores

To estimate the impact of semantic duplicates, the researchers:

Exhaustively checked for natural duplicates in OLMo 3's training corpus.
Finetuned the model to see how removing these duplicates affected performance on benchmarks.

Practical Implications

For those involved in LLM development and evaluation, this study highlights the need for more rigorous decontamination of training corpora. Here are some steps you can take:

Advanced Decontamination Techniques:
- Use embeddings and semantic search to identify and remove duplicates.
- Employ smaller filter models to handle large-scale data efficiently.
Translating Test Sets:
- Translate test sets into multiple languages and remove these translations from the training data to avoid memorization.
OOD Evaluation:
- Focus on evaluating models using out-of-distribution data to ensure they generalize well beyond their training corpus.

Conclusion

The presence of semantic duplicates in LLM training corpora is a significant issue that can lead to inflated benchmark scores and misleading performance metrics. By adopting more advanced decontamination techniques, researchers and practitioners can better ensure that their models are truly capable of generalizing and reasoning, not just memorizing or pattern-matching.