Language Models Can Store 2 Bits of Knowledge Per Parameter, New Study Finds

Models & Research

The Engineer

28 Aug 2024 · 3 min read

Researchers uncover that language models store up to 2 bits of knowledge per parameter, revealing how training duration and data preprocessing enhance a model's capacity to retain information.

Key Takeaways:

Knowledge Storage Capacity: Language models can store up to 2 bits of knowledge per parameter.
Training and Architecture Impact: GPT-2 with rotary embedding matches or surpasses LLaMA/Mistral in knowledge storage, especially over shorter training durations.
Data Preprocessing: Prepending domain names to training data significantly boosts a model's knowledge capacity.

Introduction

In the latest installment of their "Physics of Language Models" series, Zeyuan Allen-Zhu and Yuanzhi Li delve into the scaling laws governing the knowledge storage capabilities of large language models (LLMs). Unlike previous studies that focus on loss or benchmark performance, this research quantifies how much factual knowledge a model can store. The findings are significant for practitioners looking to optimize their LLMs for specific tasks.

Technical Breakdown

Knowledge Capacity

Key Insight: A 7 billion parameter (7B) language model can store up to 14 billion bits of knowledge, which is more than the combined content of English Wikipedia and textbooks.
Methodology: The authors estimate knowledge storage by analyzing how well models can recall factual tuples, such as (USA, capital, Washington D.C.), from controlled datasets.

Training Duration

Impact: Shorter training durations can sometimes yield better knowledge storage. This is particularly true for the GPT-2 architecture with rotary embedding, which outperforms LLaMA/Mistral in this scenario.
Reasoning: LLaMA and Mistral use GatedMLP, which is less stable and harder to train compared to the rotary embeddings used in GPT-2.

Model Architecture

GPT-2 vs. LLaMA/Mistral:
- GPT-2 with Rotary Embedding: Matches or surpasses LLaMA/Mistral in knowledge storage.
- LLaMA and Mistral: Use GatedMLP, which is less stable and more challenging to train.

Quantization

Effect on Knowledge Storage: Even when quantized to int8 (8-bit integers), language models can still store 2 bits of knowledge per parameter. This is a crucial finding for resource-constrained environments where model size matters.

Sparsity Constraints

MoE (Mixture of Experts): Models with sparsity constraints, such as MoE, do not significantly impact knowledge storage capacity. However, they can affect training stability and convergence.

Data Signal-to-Noise Ratio

Domain Prepending: Prepending domain names to the training data (e.g., [wikipedia.org/]) significantly increases a model's ability to store knowledge.
- Mechanism: Language models can autonomously identify and prioritize domains rich in factual information, optimizing their storage capacity.

Practical Implications

For practitioners, these findings offer several actionable insights:

Architecture Choice: GPT-2 with rotary embedding is a strong choice for tasks requiring high knowledge storage, especially when training time is limited.
Data Preprocessing: Including domain names in the training data can enhance a model's ability to store and recall factual information.
Resource Management: Quantization to int8 allows for efficient use of resources without compromising knowledge storage capacity.

Conclusion

The study by Allen-Zhu and Li provides a deeper understanding of how language models store and retrieve knowledge. By focusing on the number of bits stored per parameter, they offer practical guidance for optimizing LLMs in various applications. Whether you're working with resource-constrained devices or looking to enhance your model's factual recall, these insights are invaluable.