Can large language models reason about medical questions?

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 3,72 MB, PDF-dokument

Valentin Liévin
Christoffer Egeberg Hother
Andreas Geert Motzfeldt
Winther, Ole

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

Originalsprog	Engelsk
Artikelnummer	100943
Tidsskrift	Patterns
Vol/bind	5
Udgave nummer	3
Antal sider	12
DOI	https://doi.org/10.1016/j.patter.2024.100943
Status	Udgivet - 2024

Bibliografisk note

Funding Information:
We thank OpenAI for granting access to the Codex beta program. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MeluXina at LuxProvide, Luxembourg. V.L.’s work was funded in part by Google DeepMind through a PhD grant. O.W.’s work was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science ( NNF20OC0062606 ). V.L., A.G.M., and O.W. acknowledge support from the Pioneer Center for AI , DNRF grant number P1.

Publisher Copyright:
© 2024 The Authors

ID: 385123211

Biologisk Institut