HealthAdminBench: Evaluating Computer-Use Agents on Realistic Healthcare Administration Tasks
Healthcare administration accounts for $1 trillion of annual spend, representing an ideal application for LLM-based computer-use agents. While clinical applications of LLMs have received significant attention, no benchmark currently exists for evaluating LLMs on end-to-end administrative workflows. To address this gap, we introduce HEALTHADMINBENCH, which consists of four realistic GUI environments (an EHR, two payer portals, and a fax system) with 70 expert-defined tasks including insurance verification, prior authorization, clinical reasoning, and medical equipment ordering. Each task is annotated with fine-grained, verifiable success criteria, totaling 947 unique points of evaluation. We evaluate six frontier computer-use agents under multiple prompting and observation settings.
The best-performing agent on full-task success, Claude Opus 4.5, successfully completes only 22.9% of tasks, while Gemini 3 attains the highest subtask success rate (76.4%), revealing a substantial gap between current agent capabilities and the demands of real-world administration work. Across models, document handling subtasks remain the lowest-performing category, with five of six models achieving under 20% accuracy, making cross-application data transfer the dominant source of failure. HEALTHADMINBENCH establishes a rigorous foundation for measuring progress toward safe, reliable automation of high-stakes healthcare administrative tasks, and we welcome future contributions from the community.