The integration of artificial intelligence into the healthcare sector has moved beyond the realm of speculative science fiction and into the daily reality of clinical practice. Every day, millions of Americans turn to large language models (LLMs) to decipher symptoms, understand diagnoses, or validate treatment plans. This shift is not limited to patients; recent data suggests that two out of three U.S. physicians now incorporate AI tools into their professional workflows, with approximately 20% using these systems to assist with direct patient care decisions. However, as the adoption of these tools accelerates, a critical vacuum of evidence has persisted regarding which systems can be trusted with human life.
Traditional methods of evaluating medical AI have long relied on standardized testing, such as the United States Medical Licensing Examination (USMLE). While a model’s ability to pass a multiple-choice exam is impressive, it is a poor proxy for clinical competence. Passing a board exam requires the recall of textbook facts, but managing a complex patient requires nuanced reasoning, the ability to prioritize interventions, and an acute awareness of when an omission is just as dangerous as an incorrect prescription. Addressing this gap, a landmark study known as NOHARM (Numerous Options Harm Assessment for Risk in Medicine)—conducted by a multi-institutional team from Stanford, Harvard, and other leading centers—has provided the most comprehensive evaluation of medical AI to date. The results offer a startling look at the current state of digital health, revealing that the best AI models are already outperforming board-certified internists in key safety and accuracy metrics.
Beyond the Board Exam: The NOHARM Methodology
To move beyond the limitations of "textbook" vignettes, the NOHARM researchers developed a rigorous testing framework based on 100 real-world clinical cases. These were not sanitized scenarios; they were drawn from the electronic consultation systems of Stanford Health Care, representing actual queries submitted by primary care physicians to specialists. These cases are inherently messy, containing the ambiguities and conflicting data points that define real-world medicine.
The study employed 29 board-certified specialists and sub-specialists to review the potential actions recommended by 31 different AI models. Each recommendation was scrutinized for clinical appropriateness and potential for harm. The experts generated over 12,000 annotations across more than 4,000 specific decision points, such as whether to order a specific imaging study, prescribe a particular medication, or escalate a patient to emergency care. This "ground truth" was established with high consensus, as the specialists agreed on the correct course of action more than 95% of the time.
The Leaderboard: Specialized Systems vs. Generalist Giants
The performance data revealed a clear hierarchy among AI models. The overall winner was AMBOSS LiSA 1.0, a system that achieved a score of 62.3%. While this percentage might seem low to a layperson, it reflects an incredibly high bar: the scoring system penalized models for "safety traps" and harmful recommendations across dozens of decision points per case. AMBOSS LiSA’s success is largely attributed to its architecture as a retrieval-augmented generation (RAG) system, which grounds its responses in a curated, high-quality medical knowledge base rather than relying solely on the probabilistic "next-word" logic of a general-purpose LLM.
Closely following the leader were Google’s Gemini 2.5 Pro (59.9%), Glass Health 4.0 (59.0%), OpenAI’s GPT-5 (58.3%), and Anthropic’s Claude Sonnet 4.5 (58.2%). The statistical variance between these top-tier models was minimal, suggesting that the industry’s "frontier" models are converging on a similar level of clinical capability. However, the study also highlighted a significant "performance floor." Smaller "mini" models, such as GPT-4o mini and various iterations of the o1 and o3 series, struggled significantly, with scores languishing in the 40% range. This suggests that the "distillation" of models to make them faster and cheaper often strips away the complex reasoning required for medical safety.
The Safety-Restraint Paradox
One of the most profound findings of the NOHARM study is what researchers have termed the "Safety-Restraint Paradox." In the world of AI development, "restraint" refers to the model’s tendency to withhold a recommendation when it is uncertain. To avoid liability and prevent "hallucinations," many developers tune their models to be extremely cautious, often defaulting to generic advice like "consult your doctor."
However, the study found that excessive caution is a double-edged sword. In 22% of the cases where AI recommendations led to potential severe harm, 77% of those instances were caused by omission—the model failed to suggest a critical action, such as a life-saving test or an urgent referral. This creates an inverted-U relationship between restraint and safety. If a model has too little restraint, it makes reckless, "trigger-happy" recommendations. If it has too much, it becomes uselessly silent in the face of a medical emergency. The safest models, such as Gemini 2.5 Pro, occupied the middle ground, balancing the courage to recommend necessary interventions with the wisdom to avoid unnecessary ones.

Man vs. Machine: A Provocative Comparison
Perhaps the most controversial aspect of the research was the direct comparison between AI and human physicians. The team tested 10 board-certified internal medicine physicians, allowing them to use standard digital resources like UpToDate and internet searches, but prohibiting the use of AI.
The data showed that the top-performing AI models outperformed the human internists by 15 percentage points overall and by 10 points on safety metrics. This does not suggest that AI is ready to replace doctors; rather, it highlights the limitations of human cognitive bandwidth when processing complex, multi-factorial cases under time pressure. Human physicians still possess essential qualities that AI lacks—physical examination skills, emotional intelligence, ethical accountability, and the ability to navigate the social determinants of health. However, the study makes a compelling case that a doctor working with AI is likely to be significantly safer than a doctor working alone.
The Power of the "AI Tumor Board"
In oncology, a "tumor board" is a meeting where specialists from different disciplines—surgeons, radiologists, and oncologists—collaborate to determine the best treatment plan. The NOHARM study found that AI works best when it mimics this collaborative approach.
The researchers tested "multi-agent" configurations where one model acted as an advisor and others acted as "guardians" to review and refine the output. These multi-agent systems were nearly six times more likely to reach the top quartile of safety performance compared to solo models. Interestingly, the best results came from "heterogeneous" teams—combinations of models from different developers. For instance, a team consisting of Meta’s Llama 4 Scout (open-source), Google’s Gemini 2.5 Pro (proprietary), and AMBOSS LiSA (medically grounded) proved to be more robust than a team of three different versions of GPT. This diversity of "thought" prevents the system from falling into the same cognitive traps, as different architectures have different strengths and blind spots.
Industry Implications and the Path Forward
The NOHARM study serves as a wake-up call for the healthcare technology industry. It demonstrates that the future of medical AI lies in specialization and grounding. General-purpose models, while impressive, are prone to dangerous omissions if they are not specifically tuned for clinical workflows.
For healthcare administrators and clinicians, the takeaway is that "which AI" matters just as much as "whether AI." The massive performance gap between the best and worst models—where the worst made three times as many severe errors as the best—means that hospital systems must be discerning in their procurement processes. The NOHARM leaderboard, which the researchers intend to maintain as a live, public resource, provides the first objective "Consumer Reports" style guide for medical AI.
Looking ahead, we are likely to see a shift toward "agentic" medical systems—AI that doesn’t just answer questions but actively monitors patient data and suggests interventions in real-time, cross-checked by guardian models. As these systems move from administrative support into the heart of clinical decision-making, the rigorous, harm-based evaluation pioneered by the NOHARM team will become the gold standard for regulatory approval and ethical deployment.
In the final analysis, the goal is not to automate the physician but to augment the clinical environment to a point where "preventable harm" becomes a relic of the past. By identifying the specific ways AI can fail—and the specific configurations that make it succeed—this research provides a roadmap for a future where digital intelligence serves as a reliable, life-saving ally at the bedside.
