This shows you the differences between two versions of the page.
|
healthadminbench [2026/03/26 09:12] nigam created |
healthadminbench [2026/03/26 09:15] (current) nigam |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Placeholder page for ... | + | HealthAdminBench: |
| + | |||
| + | Healthcare administration accounts | ||
| + | medical equipment ordering. Each task is annotated with fine-grained, | ||
| + | |||
| + | The best-performing agent on full-task success, Claude Opus 4.5, successfully completes only 22.9% of tasks, while Gemini 3 attains the highest subtask success rate (76.4%), revealing a substantial gap between current agent capabilities and the demands of real-world administration work. Across models, document handling subtasks remain the lowest-performing category, with five of six models achieving under 20% accuracy, making cross-application data transfer the dominant source of failure. HEALTHADMINBENCH establishes a rigorous foundation for measuring progress toward safe, reliable automation of high-stakes healthcare administrative tasks, and we welcome future contributions from the community. | ||