Healthcare professionals are increasingly taking a leading role in evaluating artificial intelligence tools before they are widely adopted in hospitals and clinics.
At the HIMSS Global Health Conference & Exposition 2026 in Las Vegas, experts discussed how initiatives like the Healthcare AI Challenge are helping health systems assess AI technologies more safely and effectively.
The discussion featured Nabile Safdar, chief AI officer at Emory Healthcare, and Bernardo Bizzo, senior director of artificial intelligence at Mass General Brigham.
Need for Clear Evaluation Frameworks
Artificial intelligence is rapidly transforming healthcare, from medical imaging to automated documentation. However, many clinicians still lack structured frameworks to evaluate whether these tools are safe, reliable, and useful in real clinical settings.
According to Safdar and Bizzo, healthcare organizations often struggle to decide whether new AI systems truly deliver value before investing in them.
Deploying AI tools can require major investments, including:
- Integrating the technology into existing workflows
- Training healthcare staff
- Ensuring compatibility with hospital systems
Because of these costs, hospitals need reliable ways to measure the benefits of AI before adopting it.
The Healthcare AI Challenge Initiative
To address these concerns, researchers created the Healthcare AI Challenge, a collaborative initiative that allows clinicians and researchers to test different AI models using shared datasets and standardized evaluation methods.
The initiative includes a platform known as “AI Arena,” where clinical experts compare outputs from different AI models.
Through this platform, participants can evaluate AI systems across several criteria, including:
- Technical accuracy
- Clinical usefulness
- Speed and workflow efficiency
- Overall impact on healthcare delivery
Safdar noted that accuracy alone does not determine whether a tool is useful.
“Often your family practice clinician is thinking, ‘Does it make me faster?’” Safdar said.
Large‑Scale Testing of AI Models
So far, the Healthcare AI Challenge has organized five evaluation challenges involving more than 4,500 assessments from around 200 participants across 40 institutions.
Researchers have tested 18 foundational AI models, including both general-purpose systems and models designed specifically for healthcare applications.
Participants can also compare AI performance with human clinicians to determine whether the technology truly improves productivity.
Measuring Real‑World Impact
Experts say the goal is to create a repeatable evaluation process that healthcare organizations can use before adopting new AI tools.
This approach could help hospitals better understand the potential return on investment (ROI) and avoid deploying technologies that may not deliver meaningful benefits.
Looking ahead, Bizzo said the platform may eventually integrate directly with electronic health record (EHR) systems and support the evaluation of emerging agentic AI workflows.
These efforts aim to measure not just technical performance but also real improvements in clinical efficiency.
“We want to measure how much more efficient users are and have that information available so you know how much ROI you can expect,” Bizzo said.
A Growing Role for Clinicians in AI Oversight
As AI becomes more deeply embedded in healthcare, clinicians are expected to play a critical role in determining how and when these tools are used.
By involving doctors and medical researchers in the evaluation process, healthcare systems hope to ensure that AI technologies improve patient care while maintaining safety and trust.





















