Can large language models reason about medical questions?
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Can large language models reason about medical questions? / Liévin, Valentin; Hother, Christoffer Egeberg; Motzfeldt, Andreas Geert; Winther, Ole.
In: Patterns, Vol. 5, No. 3, 100943, 2024.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Can large language models reason about medical questions?
AU - Liévin, Valentin
AU - Hother, Christoffer Egeberg
AU - Motzfeldt, Andreas Geert
AU - Winther, Ole
N1 - Publisher Copyright: © 2024 The Authors
PY - 2024
Y1 - 2024
N2 - Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.
AB - Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.
KW - GPT-3.5
KW - large language models
KW - Llama 2
KW - machine learning
KW - medical
KW - MedQA
KW - open source
KW - prompt engineering
KW - question answering
KW - uncertainty quantification
U2 - 10.1016/j.patter.2024.100943
DO - 10.1016/j.patter.2024.100943
M3 - Journal article
AN - SCOPUS:85186698997
VL - 5
JO - Patterns
JF - Patterns
SN - 2666-3899
IS - 3
M1 - 100943
ER -
ID: 385123211