In May 2025. OpenAI introduced HealthBench, a comprehensive benchmark designed to assess the performance of large language models (LLMs) in real-world healthcare settings. Developed with input from hundreds of physicians worldwide, HealthBench is more than just a technical metric—it's a new gold standard for evaluating AI's role in patient care, medical accuracy, and doctor-AI collaboration.
What is HealthBench and how does it work?
HealthBench is a testing framework that evaluates how AI models respond to 5.000 realistic medical conversations. These simulations range from emergencies to general health inquiries, each judged against detailed rubrics created by over 260 physicians from 60 countries. Each rubric measures clinical accuracy, communication skills, and contextual understanding—totaling over 48.000 specific benchmarks.
Why is HealthBench different from past evaluations?
Unlike traditional tests that mimic medical exams, HealthBench mirrors real-world complexities. It includes thematic categories such as emergency care, global health, and diagnostic uncertainty. This provides a clearer picture of how an AI might perform in actual patient-clinician scenarios. Importantly, it's also open-source, allowing researchers to build upon and improve the system collaboratively.
Which AI models are leading on HealthBench?
OpenAI's o3 reasoning model currently tops the leaderboard with a 60% score. Elon Musk's Grok scored 54%, while Google's Gemini 2.5 Pro trailed at 52%. Interestingly, even smaller models like GPT-4.1 nano are showing strong results, outperforming earlier, more expensive models. Evaluations have also shown that AIs sometimes outperform unaided physicians—but when doctors use AI tools, their own performance improves significantly.
What's the real-world impact of HealthBench?
HealthBench aims to guide AI development with physician-centered values. It doesn't just track if an answer is technically right—it scores whether it reflects sound clinical judgment. This is vital as healthcare organizations begin to adopt AI tools more widely. The benchmark's transparent, rigorous grading could help shape policies and tools that safely integrate AI into hospitals and clinics.
Conclusion:
HealthBench is a timely response to the rapid evolution of AI in medicine. By grounding its evaluations in physician expertise and real-world complexity, it offers a much-needed framework to ensure AI tools are reliable, responsible, and ready to assist in high-stakes environments like healthcare. With open collaboration and real-world relevance at its core, HealthBench may soon become the standard against which all medical AIs are measured.






















