FACTS Grounding: A New Benchmark for Evaluating LLM Factuality and Grounding

Models & Research

The Engineer

7 Jan 2025 · 3 min read

FACTS Grounding aims to tackle the pesky problem of AI misinformation by assessing whether large language models can reliably provide accurate, context-relevant answers-a crucial step towards more trustworthy AI.

Large language models (LLMs) have revolutionized how we interact with information, but their Achilles' heel remains factual accuracy. These models can sometimes "hallucinate" or generate false information, especially when dealing with complex inputs. This issue not only erodes trust in LLMs but also limits their practical applications in real-world scenarios.

To address this challenge, the FACTS team at DeepMind has introduced FACTS Grounding, a comprehensive benchmark designed to evaluate how well LLMs can generate factually accurate and contextually grounded responses. The benchmark is complemented by an online leaderboard on Kaggle, providing a transparent way to track progress in the field.

What Changed Technically?

Comprehensive Benchmark: FACTS Grounding evaluates LLMs based on their ability to produce long-form responses that are both factually accurate and well-grounded in provided source material.
Leaderboard for Transparency: The FACTS leaderboard on Kaggle allows researchers and practitioners to see how different models perform, fostering a competitive yet collaborative environment.

Key Features of FACTS Grounding

Dataset Size and Structure:
- 1,719 Examples: Each example includes a document (up to 32,000 tokens), an instruction for the LLM to reference only the provided document, and a user request.
- Public Set: 860 examples released for anyone to use in evaluating LLMs.
- Private Set: 859 examples held out to prevent benchmark contamination and leaderboard hacking.
Diverse Input Types:
- Documents cover various domains such as finance, technology, retail, medicine, and law.
- User requests include tasks like summarization, Q&A generation, and rewriting, ensuring a wide range of evaluation scenarios.

Why It Matters for Practitioners

Improved Trust and Reliability: By focusing on factuality and grounding, FACTS Grounding helps build trust in LLMs, making them more reliable for critical applications.
Benchmark Contamination Prevention: The private set ensures that the leaderboard remains a fair and accurate measure of model performance.

Initial Leaderboard Results

The initial leaderboard has been populated with scores from leading LLMs. These scores are the average performance across both the public and private sets, providing a comprehensive view of each model's capabilities.

Implementation Details

Token Limits: Documents can be up to 32,000 tokens (approximately 20,000 words), ensuring that the benchmark can handle long-form content.
Task Variety: User requests are diverse, ranging from summarization and Q&A to rewriting tasks. This variety helps evaluate a model's ability to handle different types of inputs and outputs.

Next Steps

The FACTS team encourages researchers and practitioners to use the public set for evaluating their LLMs and to contribute to the leaderboard. By working together, the community can drive significant progress in improving the factuality and grounding of large language models.