
OpenAI's new dataset evaluates how well AI answers medical questions
13 May 2025
OpenAI has launched HealthBench, a comprehensive dataset to assess the performance of AI models in answering health-related questions.
Backed by detailed evaluation tools, this open-source resource is touted as a major step forward for AI applications in healthcare.
HealthBench was developed in partnership with 262 doctors across 60 countries and features 5,000 simulated health conversations.
How are responses graded?
Steps
Each AI response is assessed against a guide designed by doctors, with criteria weighted according to medical judgment.
The responses are scored using GPT-4.1, an advanced language model developed by OpenAI.
This collaborative approach guarantees that the dataset is thorough and reflective of various medical perspectives globally.
OpenAI's o3 model outperforms competitors in HealthBench
Comparison
As per HealthBench, OpenAI's o3 reasoning model outperformed its competitors with 60% score.
It was followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.
The dataset supports responses in 49 languages and covers 26 medical specialties like neurology and ophthalmology, making it a versatile tool for evaluating AI performance in healthcare across different regions and fields.
A look at how HealthBench works
Example
An example shared by OpenAI shows how the dataset can be used to assess an AI model's response to a medical emergency.
Here, the AI was asked what to do when you find an unresponsive neighbor on the floor. The model recommended calling emergency services, checking breathing, and ensuring clear airways.
HealthBench evaluated these responses, marking correct actions and areas of improvement, and gave a score of 77% for the case.
-
BSF jawan Purnam Shaw held captive in Pakistan released after three weeks
-
'No Music & Cheerleaders': Sunil Gavaskar Urges BCCI To Tone Down Entertainment Aspect During Remaining IPL 2025 Matches
-
Odisha: Security amped up at Paradip as ship with Pakistani crew docks
-
Odisha: Security amped up at Paradip as ship with Pakistani crew docks
-
The 7-year-old series is still very popular on OTT, every episode is full of thrill