
Share
As language models struggle with generating accurate information, AA-Omniscience offers a rigorous test to identify and quantify hallucinations across diverse fields, pushing developers to improve model reliability.
The landscape of language models is evolving rapidly, but one persistent issue has been the tendency for these models to hallucinate-generate incorrect or fabricated information. To address this, Artificial Analysis (AA) has introduced a new benchmark called AA-Omniscience, designed to evaluate both knowledge and hallucination rates across a wide range of topics.
AA-Omniscience is a comprehensive benchmark that evaluates language models on 6,000 questions spanning 42 topics within six domains: Business, Humanities & Social Sciences, Health, Law, Software Engineering, and Science, Engineering & Mathematics. The key innovation here is the introduction of the Omniscience Index, a metric that penalizes models for hallucinations by deducting points when they provide incorrect answers instead of abstaining.
Embedded knowledge in language models is crucial for real-world applications. Without accurate and reliable information, models can make incorrect assumptions, leading to potentially harmful outcomes. For instance, a model might search for "Multi Client Persistence" when it should be looking up "Model Context Protocol." This benchmark aims to push the development of more factual and reliable models by penalizing hallucinations.
The Omniscience Index is designed to create a clear incentive for models to only attempt answers when they are confident. This is crucial because current evaluation datasets often do not penalize incorrect answers, leading to a higher rate of hallucinations. By deducting points for wrong answers, AA-Omniscience encourages models to be more cautious and reliable.

The initial results from AA-Omniscience are eye-opening:
For practitioners, this benchmark provides a valuable tool for evaluating and improving model performance. By focusing on both accuracy and hallucination rates, it helps identify areas where models are weak and need improvement. For example, if a model performs well in business-related questions but struggles with health topics, developers can focus their efforts on enhancing the model's knowledge base in those specific areas.
AA plans to integrate AA-Omniscience into its broader suite of evaluation tools, including the Artificial Analysis Intelligence Index. This integration will provide a more holistic view of model performance, incorporating both knowledge and the probability of hallucination.
The introduction of AA-Omniscience marks a significant step forward in the evaluation of language models. By penalizing hallucinations and providing detailed insights into model performance across various domains, it sets a new standard for developing more reliable and accurate AI systems.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
18 November 2025
133 articles
Related Articles
Related Articles
More Stories