====== Large language models in Healthcare ====== The past year has seen significant advancements in artificial intelligence (AI) for various modalities, such as text, image, and video. Foundation models, which are AI models trained on large, unlabeled datasets and highly adaptable to new applications, are driving these innovations. These new class of models offer opportunities for a better paradigm of doing "AI in healthcare" by providing adaptability with fewer manually labeled examples, modular and robust AI, multimodality, and new interfaces for human-AI collaboration. Read about [[https://hai.stanford.edu/news/how-foundation-models-can-advance-ai-healthcare|How Foundation Models Can Advance AI in Healthcare]] Although foundation models (FMs), including large language models (LLMs), have immense potential in healthcare, evaluating their usefulness, fairness, and reliability is challenging, as they lack shared evaluation frameworks and datasets. Over 80 clinical FMs have been created, but their evaluation regimes do not establish or validate their presumed clinical value. In addition, until their factual correctness and robustness are ensured, it is difficult to justify the use of LLMs in clinical practice. Read about [[https://hai.stanford.edu/news/shaky-foundations-foundation-models-healthcare|The Shaky Foundations of Foundation Models in Healthcare ]] and see the arxiv preprint at [[https://arxiv.org/abs/2303.12961 | arxiv]]. We examined the safety and accuracy of GPT-4 in serving curbside consultation needs of doctors. Read about [[https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs|How Well Do Large Language Models Support Clinician Information Needs?]] and check out the arxiv preprint at [[https://arxiv.org/abs/2304.13714|arxiv]]. We also evaluated the ability of GPT-4 to generate realistic USMLE Step 2 exam questions by asking licensed physicians to distinguish between AI-generated and human-generated questions and to assess their validity. The results indicate that GPT-4 can create questions that are largely indistinguishable from human-generated ones, with a majority of the questions deemed "valid". Read more at [[https://www.medrxiv.org/content/10.1101/2023.04.25.23288588v1 | medrxiv]]. The video below summarizes the work described above as well as outlines the [[https://www.linkedin.com/feed/update/urn:li:activity:7031128969005432832/|questions we should always ask]] when considering LLMs for clinical use. //