Differences
This shows you the differences between two versions of the page.
Both sides previous revision
Previous revision
|
|
healthgpt [2023/04/30 18:02] nigam |
healthgpt [2023/04/30 18:03] (current) nigam |
Although foundation models (FMs), including large language models (LLMs), have immense potential in healthcare, evaluating their usefulness, fairness, and reliability is challenging, as they lack shared evaluation frameworks and datasets. Over 80 clinical FMs have been created, but their evaluation regimes do not establish or validate their presumed clinical value. In addition, until their factual correctness and robustness are ensured, it is difficult to justify the use of LLMs in clinical practice. Read about [[https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare|The Shaky Foundations of Foundation Models in Healthcare ]] and see the arxiv preprint at [[https://arxiv.org/abs/2303.12961 | arxiv]]. | Although foundation models (FMs), including large language models (LLMs), have immense potential in healthcare, evaluating their usefulness, fairness, and reliability is challenging, as they lack shared evaluation frameworks and datasets. Over 80 clinical FMs have been created, but their evaluation regimes do not establish or validate their presumed clinical value. In addition, until their factual correctness and robustness are ensured, it is difficult to justify the use of LLMs in clinical practice. Read about [[https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare|The Shaky Foundations of Foundation Models in Healthcare ]] and see the arxiv preprint at [[https://arxiv.org/abs/2303.12961 | arxiv]]. |
| |
We examined the safety and accuracy of GPT-4 in serving curbside consultation needs of doctors. Read about [[https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs|How Well Do Large Language Models Support Clinician Information Needs?]] and check out the arxiv preprint at [[https://arxiv.org/abs/2304.13714|arxiv]]. We also evaluated the ability of GPT-4 to generate realistic USMLE Step 2 exam questions by asking licensed physicians to distinguish between AI-generated and human-generated questions and to assess their validity. The results indicate that GPT-4 can create questions that are largely indistinguishable from human-generated ones, with a majority of the questions deemed "valid". Read more at [[https://www.medrxiv.org/content/10.1101/2023.04.25.23288588v1 | medrxiv]] | We examined the safety and accuracy of GPT-4 in serving curbside consultation needs of doctors. Read about [[https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs|How Well Do Large Language Models Support Clinician Information Needs?]] and check out the arxiv preprint at [[https://arxiv.org/abs/2304.13714|arxiv]]. We also evaluated the ability of GPT-4 to generate realistic USMLE Step 2 exam questions by asking licensed physicians to distinguish between AI-generated and human-generated questions and to assess their validity. The results indicate that GPT-4 can create questions that are largely indistinguishable from human-generated ones, with a majority of the questions deemed "valid". Read more at [[https://www.medrxiv.org/content/10.1101/2023.04.25.23288588v1 | medrxiv]]. |
| |
The video below summarizes the work described above as well as outlines the [[https://www.linkedin.com/feed/update/urn:li:activity:7031128969005432832/|questions we should always ask]] when considering LLMs for clinical use. | The video below summarizes the work described above as well as outlines the [[https://www.linkedin.com/feed/update/urn:li:activity:7031128969005432832/|questions we should always ask]] when considering LLMs for clinical use. |