Article Text

Download PDFPDF
Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases
  1. Daniel Milad1,2,
  2. Fares Antaki1,3,4,
  3. Jason Milad5,
  4. Andrew Farah6,
  5. Thomas Khairy6,
  6. David Mikhail7,
  7. Charles-Édouard Giguère8,
  8. Samir Touma1,2,
  9. Allison Bernstein1,2,
  10. Andrei-Alexandru Szigiato1,9,
  11. Taylor Nayman1,2,
  12. Guillaume A Mullie1,10,
  13. Renaud Duval1,2
  1. 1 Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
  2. 2 Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
  3. 3 Institute of Ophthalmology, University College London, London, UK
  4. 4 CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montreal, Quebec, Canada
  5. 5 Department of Software Engineering, University of Waterloo, Waterloo, Ontario, Canada
  6. 6 Faculty of Medicine, McGill University, Montreal, Quebec, Canada
  7. 7 Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
  8. 8 Centre de recherche de l'Institut universitaire en santé mentale de Montréal, Montréal, Quebec, Canada
  9. 9 Department of Ophthalmology, Hôpital du Sacré-Coeur de Montréal, Montreal, Quebec, Canada
  10. 10 Department of Ophthalmology, Cité-de-la-Santé Hospital, Laval, Quebec, Canada
  1. Correspondence to Dr Renaud Duval, Department of Ophthalmology, University of Montreal, Montreal, Canada; renaud.duval{at}gmail.com

Abstract

Background/aims This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.

Methods We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.

Results Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).

Conclusion Improved prompting enhances GPT-4’s performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.

Data availability statement

Data are available on reasonable request. All data produced in the present study is available on reasonable request to the authors. JAMA Ophthalmology’s Clinical Challenges are proprietary and access is available to be distributed through their website.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

Data are available on reasonable request. All data produced in the present study is available on reasonable request to the authors. JAMA Ophthalmology’s Clinical Challenges are proprietary and access is available to be distributed through their website.

View Full Text

Footnotes

  • Twitter @DanielMiladMD, @FaresAntaki

  • Contributors Conception and design of the study (DMilad, FA and RD); data collection (DMilad, FA, JM, AF, TK, DMikhail, ST, AB, A-AS, TN and GAM); data analysis (DMilad, FA, JM, DMikhail and C-EG); writing of the manuscript and preparation of figures (DMilad, FA, JM and DMikhail); supervision (RD); review and discussion of the results (all authors); edition and revision of the manuscript (all authors). DM and RD act as guarantors and accept full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.