
Share
OpenAI's HealthBench aims to set a new standard for evaluating AI in healthcare, ensuring that advanced models are not only competent but also safe and reliable for use in medical settings.
Improving human health is one of the most critical challenges we face, and artificial intelligence (AI) has the potential to play a transformative role. From expanding access to health information to supporting clinicians in delivering high-quality care, large language models could significantly enhance our healthcare system. However, ensuring these models are both useful and safe is paramount. Today, OpenAI introduces HealthBench, a new benchmark designed to measure the capabilities of AI systems in healthcare more effectively.
Healthcare is deeply personal and complex. Errors or inaccuracies can have severe consequences, from misdiagnoses to incorrect treatments. As AI models become more integrated into clinical settings, it's crucial that they perform well in real-world scenarios. HealthBench aims to bridge this gap by providing a rigorous evaluation framework that reflects the standards of healthcare professionals.
HealthBench is a benchmark designed to evaluate how well AI systems can handle realistic health conversations. It consists of 5,000 simulated interactions between AI models and users or clinicians. Each conversation includes a custom rubric created by physicians to grade model responses, ensuring that the evaluations are meaningful, trustworthy, and leave room for improvement.
Meaningful Scores: Reflect Real-World Impact
Trustworthy Scores: Faithful Indicators of Physician Judgment

HealthBench was built in collaboration with 262 physicians from 60 countries, ensuring a diverse and global perspective. The conversations included in the benchmark are multi-turn and multilingual, reflecting a range of layperson and healthcare provider personas across various medical specialties. This comprehensive approach ensures that the benchmark is both realistic and representative.
Alongside the release of HealthBench, OpenAI has shared how several of its models perform on this new benchmark. These results set a baseline for future improvements, providing a clear starting point for researchers and developers to build upon.
HealthBench represents a significant step forward in ensuring that AI systems are both effective and safe in healthcare settings. By focusing on real-world impact and aligning with the standards of healthcare professionals, this benchmark provides a valuable tool for advancing the field. As AI continues to evolve, HealthBench will play a crucial role in guiding its development and deployment, ultimately leading to better health outcomes for all.
Tags
Original Sources
About the author
Amara's entry point into AI was an epidemiology role at a London research hospital, where she spent five years studying how digital health tools reached — or conspicuously failed to reach — underserved communities. Watching early algorithmic systems in healthcare quietly entrench existing inequalities, she redirected her career toward the systemic consequences of AI at scale. She covers AI through an unflinching lens: who benefits, who bears the cost, and what evidence actually says versus what the press release claims. Her writing is calm and precise, but she doesn't mistake balance for neutrality.
More from The Steward →This Week's Edition
13 May 2025
133 articles
Related Articles
Related Articles
More Stories