AA-Omniscience Benchmark Exposes Hallucination Issues in Language Models

Models & Research

The Engineer

18 Nov 2025 · 3 min read

As language models struggle with generating accurate information, AA-Omniscience offers a rigorous test to identify and quantify hallucinations across diverse fields, pushing developers to improve model reliability.

The landscape of language models is evolving rapidly, but one persistent issue has been the tendency for these models to hallucinate-generate incorrect or fabricated information. To address this, Artificial Analysis (AA) has introduced a new benchmark called AA-Omniscience, designed to evaluate both knowledge and hallucination rates across a wide range of topics.

What Changed Technically

AA-Omniscience is a comprehensive benchmark that evaluates language models on 6,000 questions spanning 42 topics within six domains: Business, Humanities & Social Sciences, Health, Law, Software Engineering, and Science, Engineering & Mathematics. The key innovation here is the introduction of the Omniscience Index, a metric that penalizes models for hallucinations by deducting points when they provide incorrect answers instead of abstaining.

Why It Matters to Practitioners

Embedded knowledge in language models is crucial for real-world applications. Without accurate and reliable information, models can make incorrect assumptions, leading to potentially harmful outcomes. For instance, a model might search for "Multi Client Persistence" when it should be looking up "Model Context Protocol." This benchmark aims to push the development of more factual and reliable models by penalizing hallucinations.

Key Features of AA-Omniscience

Question Set: 6,000 questions across 89 sub-topics, providing a detailed view of model performance in nuanced domains.
Metrics:
- Accuracy: Percentage of correct answers.
- Hallucination Rate: Percentage of incorrect answers out of all answered and abstained questions.
- Omniscience Index: +1 for correct answers, -1 for incorrect answers where the model attempted to answer, and 0 for abstentions.
Open Source: AA is open sourcing 600 questions (10% of the dataset) to support labs in developing more factual and reliable models.

Implementation Details

The Omniscience Index is designed to create a clear incentive for models to only attempt answers when they are confident. This is crucial because current evaluation datasets often do not penalize incorrect answers, leading to a higher rate of hallucinations. By deducting points for wrong answers, AA-Omniscience encourages models to be more cautious and reliable.

Benchmark Results

The initial results from AA-Omniscience are eye-opening:

Hallucination Rate: All but three models are more likely to hallucinate than provide a correct answer when faced with difficult questions.
Omniscience Index: This metric clearly shows the gap between accuracy and reliability, highlighting the need for better knowledge embedding in language models.

Practical Implications

For practitioners, this benchmark provides a valuable tool for evaluating and improving model performance. By focusing on both accuracy and hallucination rates, it helps identify areas where models are weak and need improvement. For example, if a model performs well in business-related questions but struggles with health topics, developers can focus their efforts on enhancing the model's knowledge base in those specific areas.

Future Directions

AA plans to integrate AA-Omniscience into its broader suite of evaluation tools, including the Artificial Analysis Intelligence Index. This integration will provide a more holistic view of model performance, incorporating both knowledge and the probability of hallucination.

Conclusion

The introduction of AA-Omniscience marks a significant step forward in the evaluation of language models. By penalizing hallucinations and providing detailed insights into model performance across various domains, it sets a new standard for developing more reliable and accurate AI systems.