Can large language models reason about medical questions?

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

Can large language models reason about medical questions? / Liévin, Valentin; Hother, Christoffer Egeberg; Motzfeldt, Andreas Geert; Winther, Ole.

I: Patterns, Bind 5, Nr. 3, 100943, 2024.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Liévin, V, Hother, CE, Motzfeldt, AG & Winther, O 2024, 'Can large language models reason about medical questions?', Patterns, bind 5, nr. 3, 100943. https://doi.org/10.1016/j.patter.2024.100943

APA

Liévin, V., Hother, C. E., Motzfeldt, A. G., & Winther, O. (2024). Can large language models reason about medical questions? Patterns, 5(3), [100943]. https://doi.org/10.1016/j.patter.2024.100943

Vancouver

Liévin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns. 2024;5(3). 100943. https://doi.org/10.1016/j.patter.2024.100943

Author

Liévin, Valentin ; Hother, Christoffer Egeberg ; Motzfeldt, Andreas Geert ; Winther, Ole. / Can large language models reason about medical questions?. I: Patterns. 2024 ; Bind 5, Nr. 3.

Bibtex

@article{40ccdad79e954ae39463d3587149358e,
title = "Can large language models reason about medical questions?",
abstract = "Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.",
keywords = "GPT-3.5, large language models, Llama 2, machine learning, medical, MedQA, open source, prompt engineering, question answering, uncertainty quantification",
author = "Valentin Li{\'e}vin and Hother, {Christoffer Egeberg} and Motzfeldt, {Andreas Geert} and Ole Winther",
note = "Publisher Copyright: {\textcopyright} 2024 The Authors",
year = "2024",
doi = "10.1016/j.patter.2024.100943",
language = "English",
volume = "5",
journal = "Patterns",
issn = "2666-3899",
publisher = "Cell Press",
number = "3",

}

RIS

TY - JOUR

T1 - Can large language models reason about medical questions?

AU - Liévin, Valentin

AU - Hother, Christoffer Egeberg

AU - Motzfeldt, Andreas Geert

AU - Winther, Ole

N1 - Publisher Copyright: © 2024 The Authors

PY - 2024

Y1 - 2024

N2 - Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

AB - Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

KW - GPT-3.5

KW - large language models

KW - Llama 2

KW - machine learning

KW - medical

KW - MedQA

KW - open source

KW - prompt engineering

KW - question answering

KW - uncertainty quantification

U2 - 10.1016/j.patter.2024.100943

DO - 10.1016/j.patter.2024.100943

M3 - Journal article

AN - SCOPUS:85186698997

VL - 5

JO - Patterns

JF - Patterns

SN - 2666-3899

IS - 3

M1 - 100943

ER -

ID: 385123211