Pruning Training Data Enhances Fact Memorization in Large Language Models

Models & Research

The Engineer

14 Apr 2026 · 3 min read

Researchers propose a method to improve large language models' fact memorization by selectively pruning training data, potentially reducing hallucinations and enhancing performance on factual tasks.

Large language models (LLMs) have made significant strides in natural language processing, but they often struggle with memorizing factual knowledge. This can lead to issues like hallucinations and poor performance on tasks that require a robust understanding of facts. A recent paper by Jiayuan Ye, Vitaly Feldman, and Kunal Talwar, presented at the ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models, offers a novel approach to this problem: data pruning.

The Problem with Fact Memorization

LLMs are powerful but have limited capacity. When training these models, they can only memorize a finite amount of information. If the training data contains more facts than the model can handle, it leads to suboptimal fact accuracy. This is particularly problematic when the distribution of facts in the training data is skewed (following a power-law distribution), as common facts are overrepresented while rare ones are underrepresented.

Information-Theoretic Perspective

The authors formalize the issue from an information-theoretic perspective, showing that fact accuracy drops below the model's capacity limit when the information content of the training data exceeds this limit. This is exacerbated by skewed frequency distributions, where some facts appear much more frequently than others.

Data Selection Schemes

To address this, the researchers propose data selection schemes based on training loss. The goal is to reduce the number of facts in the training data and flatten their frequency distribution. Here’s a breakdown of their approach:

Training Loss-Based Pruning: By selecting data points that contribute more to the training loss, they aim to retain the most informative examples while discarding less useful ones.
Flattening Frequency Distribution: This ensures that no single fact dominates the training process, leading to a more balanced and effective memorization of facts.

Experimental Results

The researchers tested their method on both semi-synthetic datasets and real-world data. Here are the key findings:

Semi-Synthetic Datasets: On datasets containing high-entropy facts (facts with high information content), their selection method effectively boosted fact accuracy to the model's capacity limit.
Wikipedia Corpus: When pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus, their method enabled the model to memorize 1.3 times more entity facts compared to standard training. This performance matched that of a much larger model (1.3B parameters) pretrained on the full dataset.

Implications for Practitioners

This research has several practical implications for practitioners working with LLMs:

Efficient Training: By pruning the training data, you can achieve better fact memorization without increasing the model size, which is computationally and resource-efficient.
Reducing Hallucinations: Improved fact accuracy can lead to fewer hallucinations in generated text, making the models more reliable for knowledge-intensive tasks.
Scalability: The method can be applied to various datasets and models, making it a versatile tool for enhancing LLM performance.

Future Directions

While this work provides a promising approach to improving fact memorization, there are still areas for further exploration:

Generalizability: Testing the method on different types of datasets and tasks to ensure its effectiveness in diverse scenarios.
Privacy Considerations: Investigating how data pruning affects privacy concerns, especially when training on sensitive user data.

Conclusion

By pruning training data based on training loss, researchers have shown that it's possible to enhance fact memorization in LLMs. This not only improves the models' performance on knowledge-intensive tasks but also makes them more efficient and reliable. As the field continues to evolve, these insights could pave the way for even more advanced techniques in data selection and model training.