Differences

This shows you the differences between two versions of the page.

--- healthadminbench [2026/03/26 09:12]
nigam created
+++ healthadminbench [2026/03/26 09:15] (current)
nigam
@@ Line 1: / Line 1: @@
-Placeholder page for ...
+HealthAdminBench: Evaluating Computer-Use Agents on Realistic Healthcare Administration Tasks
+Healthcare administration accounts for $1 trillion of annual spend, representing an ideal application for LLM-based computer-use agents. While clinical applications of LLMs have received significant attention, no benchmark currently exists for evaluating LLMs on end-to-end administrative workflows. To address this gap, we introduce HEALTHADMINBENCH, which consists of four realistic GUI environments (an EHR, two payer portals, and a fax system) with 70 expert-defined tasks including insurance verification, prior authorization, clinical reasoning, and
+medical equipment ordering. Each task is annotated with fine-grained, verifiable success criteria, totaling 947 unique points of evaluation. We evaluate six frontier computer-use agents under multiple prompting and observation settings.
+The best-performing agent on full-task success, Claude Opus 4.5, successfully completes only 22.9% of tasks, while Gemini 3 attains the highest subtask success rate (76.4%), revealing a substantial gap between current agent capabilities and the demands of real-world administration work. Across models, document handling subtasks remain the lowest-performing category, with five of six models achieving under 20% accuracy, making cross-application data transfer the dominant source of failure. HEALTHADMINBENCH establishes a rigorous foundation for measuring progress toward safe, reliable automation of high-stakes healthcare administrative tasks, and we welcome future contributions from the community.

Shah Lab

User Tools

Site Tools

Differences

Page Tools