
Share
Researchers propose a method to improve large language models' fact memorization by selectively pruning training data, potentially reducing hallucinations and enhancing performance on factual tasks.
Large language models (LLMs) have made significant strides in natural language processing, but they often struggle with memorizing factual knowledge. This can lead to issues like hallucinations and poor performance on tasks that require a robust understanding of facts. A recent paper by Jiayuan Ye, Vitaly Feldman, and Kunal Talwar, presented at the ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models, offers a novel approach to this problem: data pruning.
LLMs are powerful but have limited capacity. When training these models, they can only memorize a finite amount of information. If the training data contains more facts than the model can handle, it leads to suboptimal fact accuracy. This is particularly problematic when the distribution of facts in the training data is skewed (following a power-law distribution), as common facts are overrepresented while rare ones are underrepresented.
The authors formalize the issue from an information-theoretic perspective, showing that fact accuracy drops below the model's capacity limit when the information content of the training data exceeds this limit. This is exacerbated by skewed frequency distributions, where some facts appear much more frequently than others.
To address this, the researchers propose data selection schemes based on training loss. The goal is to reduce the number of facts in the training data and flatten their frequency distribution. Here’s a breakdown of their approach:
The researchers tested their method on both semi-synthetic datasets and real-world data. Here are the key findings:

This research has several practical implications for practitioners working with LLMs:
While this work provides a promising approach to improving fact memorization, there are still areas for further exploration:
By pruning training data based on training loss, researchers have shown that it's possible to enhance fact memorization in LLMs. This not only improves the models' performance on knowledge-intensive tasks but also makes them more efficient and reliable. As the field continues to evolve, these insights could pave the way for even more advanced techniques in data selection and model training.
Tags
Original Sources
↗ https://machinelearning.apple.com/research/cram-less?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
14 April 2026
133 articles
Related Articles
Related Articles
More Stories