User Tools

Site Tools


healthadminbench

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

healthadminbench [2026/03/26 09:12]
nigam created
healthadminbench [2026/03/26 09:15] (current)
nigam
Line 1: Line 1:
-Placeholder page for ...+HealthAdminBench: Evaluating Computer-Use Agents on Realistic Healthcare Administration Tasks 
 + 
 +Healthcare administration accounts for $1 trillion of annual spend, representing an ideal application for LLM-based computer-use agentsWhile clinical applications of LLMs have received significant attention, no benchmark currently exists for evaluating LLMs on end-to-end administrative workflowsTo address this gap, we introduce HEALTHADMINBENCH, which consists of four realistic GUI environments (an EHR, two payer portals, and a fax system) with 70 expert-defined tasks including insurance verification, prior authorization, clinical reasoning, and 
 +medical equipment ordering. Each task is annotated with fine-grained, verifiable success criteria, totaling 947 unique points of evaluation. We evaluate six frontier computer-use agents under multiple prompting and observation settings. 
 + 
 +The best-performing agent on full-task success, Claude Opus 4.5, successfully completes only 22.9% of tasks, while Gemini 3 attains the highest subtask success rate (76.4%), revealing a substantial gap between current agent capabilities and the demands of real-world administration work. Across models, document handling subtasks remain the lowest-performing category, with five of six models achieving under 20% accuracy, making cross-application data transfer the dominant source of failure. HEALTHADMINBENCH establishes a rigorous foundation for measuring progress toward safe, reliable automation of high-stakes healthcare administrative tasks, and we welcome future contributions from the community.
healthadminbench.1774541568.txt.gz · Last modified: 2026/03/26 09:12 by nigam