HEADLINE: ERASE: A New Approach to Keeping Language Models Up-to-Date with Editable External Knowledge

Models & Research

The Engineer

19 Jun 2024 · 4 min read

Researchers introduce ERASE, a novel technique that allows for precise updates to language models' knowledge bases without full retraining, ensuring they stay current with minimal effort.

In the rapidly changing world of natural language processing (NLP), keeping language models relevant and accurate is a significant challenge. Traditional methods often rely on retraining models from scratch or using retrieval-augmented generation, where new documents are inserted into a knowledge base for downstream tasks. However, these approaches can be inefficient and sometimes fail to reflect the latest information accurately.

A recent paper titled "Language Modeling with Editable External Knowledge" by Belinda Z. Li, Emmy Liu, Alexis Ross, Abbas Zeitoun, Graham Neubig, and Jacob Andreas introduces ERASE (Editable Retrieval-Augmented System for Efficiency), a novel method that enhances model behavior when new documents are added to the knowledge base. Instead of just retrieving new information, ERASE incrementally deletes or rewrites other entries in the knowledge base to maintain coherence and relevance.

Key Technical Changes and Why They Matter

Incremental Updates: Unlike static retrieval-augmented systems, ERASE allows for dynamic updates. Each time a new document is added, it triggers a process where existing entries are either deleted or rewritten to ensure consistency.
- Why It Matters: This approach helps maintain the integrity of the knowledge base over time, reducing the risk of outdated information affecting model performance.
Benchmark Performance: ERASE was evaluated on two new benchmark datasets designed to test models' ability to answer questions about a stream of news articles or conversations. The results show significant improvements:
- Mixtral-8x7B: 7-13% absolute improvement in accuracy.
- Llama-3-8B: 6-10% absolute improvement in accuracy.
- Why It Matters: These benchmarks demonstrate that ERASE can effectively handle real-world scenarios where information is constantly evolving.

How ERASE Works

ERASE operates through a multi-step process:

Document Ingestion: When a new document is added to the knowledge base, it triggers an ingestion process.
- Feature Extraction: The system extracts key features from the new document, such as entities and relationships.
- Similarity Search: It then searches for similar entries in the existing knowledge base using techniques like cosine similarity.

Decision Making: Based on the similarity search, ERASE decides whether to:
- Delete irrelevant or outdated entries.
- Rewrite conflicting information to maintain consistency.
- Why It Matters: This decision-making process ensures that the knowledge base remains relevant and coherent, even as new information is added.
Model Update: The updated knowledge base is then used to retrain the language model incrementally.
- Incremental Training: Instead of retraining from scratch, ERASE uses techniques like fine-tuning on the updated data.
- Why It Matters: Incremental training is more efficient and less resource-intensive than full retraining.

Implementation Details

Architecture: ERASE leverages a combination of transformer models for feature extraction and decision-making processes.
- Transformer Models: These models are pre-trained on large datasets to extract meaningful features from text.
- Decision-Making Module: This module uses a combination of rules and machine learning to decide whether to delete or rewrite entries.
Benchmarks:
- News Article Dataset: Contains a stream of news articles with varying topics and time frames.
- Conversation Dataset: Simulates real-world conversations with evolving information.
- Why It Matters: These datasets provide a realistic test environment for evaluating the effectiveness of ERASE in dynamic scenarios.
Code and Data Availability: The authors have made the code and data used in their experiments available on GitHub at this link.
- Why It Matters: Open-source availability allows other researchers to reproduce and build upon this work, fostering collaboration and innovation.

Conclusion

ERASE represents a significant step forward in maintaining the relevance and accuracy of language models in dynamic environments. By allowing for incremental updates and ensuring consistency within the knowledge base, ERASE addresses key challenges faced by traditional retrieval-augmented systems. The impressive performance gains on benchmark datasets further validate its effectiveness. For practitioners, this approach offers a practical solution to keeping NLP models up-to-date without the need for extensive retraining.