Article Text

Clinical science
Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering
  1. Fares Antaki1,2,3,4,5,
  2. Daniel Milad4,5,6,
  3. Mark A Chia1,2,
  4. Charles-Édouard Giguère7,
  5. Samir Touma4,5,6,
  6. Jonathan El-Khoury4,5,6,
  7. Pearse A Keane1,2,8,
  8. Renaud Duval4,6
  1. 1 Moorfields Eye Hospital NHS Foundation Trust, London, UK
  2. 2 Institute of Ophthalmology, UCL, London, UK
  3. 3 The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
  4. 4 Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
  5. 5 Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
  6. 6 Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
  7. 7 Institut universitaire en santé mentale de Montréal (IUSMM), Montreal, Quebec, Canada
  8. 8 NIHR Moorfields Biomedical Research Centre, London, UK
  1. Correspondence to Dr Renaud Duval; renaud.duval{at}gmail.com; Mr Pearse A Keane; p.keane{at}ucl.ac.uk

Abstract

Background Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed.

Methods We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance.

Results GPT-4–0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4-0.3 answer accuracy. GPT-4-0.3’s performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09).

Conclusion GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.

  • Medical Education

Data availability statement

Data may be obtained from a third party and are not publicly available. All data produced in the present study are available on reasonable request to the authors. The specific question sets are proprietary of the BCSC Self-Assessment Program and OphthoQuestions and cannot be shared.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Large language models (LLMs) are a novel type of artificial intelligence algorithm that can generate text after being trained on large amounts of unlabeled data. Generative Pre-trained Transformer 4 (GPT-4) is a popular LLM that showed impressive accuracy in answering general medicine questions, but has not yet been extensively evaluated for its test-taking ability in ophthalmology.

WHAT THIS STUDY ADDS

  • Our study reports the accuracy of GPT-4 on questions from the Basic and Clinical Science Course Self-Assessment Program and the OphthoQuestions online question banks. We provide insights on ideal model settings (temperature/creativity) and compare the best model to GPT-3.5 and historical human performance.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Our study provides evidence on the capabilities of LLMs in our specialty. We show that GPT-4, despite being a general purpose model that has not been fine-tuned for ophthalmology, performs better than GPT-3.5 and not significantly different from the average human trainee when answering board-style questions.

Introduction

Over the past months, natural language processing—a specialisation of artificial intelligence (AI)—has gained substantial attention in academia and in the press due to the release of so called ‘foundation models’.1 Foundation models represent a novel paradigm for building AI systems: they are pretrained at scale on billions of unannotated multimodal data in a self-supervised manner and then fine-tuned for specific tasks through transfer learning.1 2 Large language models (LLMs) are fine-tuned foundation models that are trained on vast text corpora from the internet and that can generate responses in natural language.3 Their goal is to minimise the discrepancy between the predicted word and the actual word within their training dataset.4 Following successful training, the model is capable of creating new text by feeding it an initial prompt, then letting it predict the subsequent words based on statistical patterns learnt from its training data. Two prominent examples of such models are OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Pathways Language Model (PaLM). Both LLMs were trained on multilingual text data from the internet and can generate human-like text, perform advanced reasoning and generate code.5 6

There has been a growing interest in exploring the potential of LLMs in medicine. A first step in evaluating their medical-domain capabilities has been to explore the challenging task of answering medical questions. This task necessitates comprehension of medical context, recall of medical knowledge as well as reasoning—a skill set that requires years of training and hands-on experience to master.7 In December 2022, Singhal et al demonstrated state-of-the-art performance of Flan-PaLM in responding to US Medical Licensing Examination (USMLE) style questions, reaching 67.6% accuracy.8 Less than 5 months later, in May 2023, they reported an accuracy of 86.5% on the same dataset with Med-PaLM 2, marking a 19% improvement over its predecessor.9 Comparable rapid and substantial improvements in performance were reported by OpenAI when GPT-4 was introduced. GPT-4 performed significantly better than GPT-3.5 on numerous academic benchmarks, exhibiting human-level performance.4

In January 2023, we reported the first results on the performance of LLMs in the ophthalmology question-answering space. We showed that ChatGPT (using GPT-3.5) demonstrated improving accuracy in answering questions from the American Academy of Ophthalmology (AAO)’s Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question banks, with accuracy reaching 59.4% and 49.2%, respectively.10 Since our initial report, numerous studies have expanded on our findings, reporting equivalent or superior performance of various LLMs over GPT-3.5 on a variety of ophthalmic question banks.11–14

In this study, we investigate the accuracy of GPT-4 on the BCSC and OphthoQuestions datasets. We generated responses at different ‘temperature’ settings, controlling the entropy or creativity of GPT-4, with the primary aim of identifying the optimal setting for question-answering within ophthalmology. This included both a quantitative and qualitative analysis of GPT-4 responses through physician rating of answers. Then, we compared the best GPT-4 model to GPT-3.5 and contextualised our findings with historical human performance data.

Methods

Exploring BCSC and OphthoQuestions

In January 2023, after obtaining written permission from the AAO, we randomly sampled 260 questions from a pool of 4458 available in the BCSC Self-Assessment Program. Alongside this, we drew an additional 260 questions from a total of 4539 questions available on OphthoQuestions (www.ophthoquestions.com). We chose to only use questions that did not incorporate visual data such as clinical, radiological or graphical images, as the GPT-4 model we used was unable to process this kind of data. Although GPT-4 does have image processing capabilities, this feature was not publicly available at the time of writing in July 2023.15 We produced 20 random questions from each of the 13 ophthalmology subspecialties, as categorised by the BCSC curriculum.16

Our prior publication thoroughly outlines the features of the BCSC and OphthoQuestions test sets, including question distribution by examination section, cognitive level and difficulty.10 We labelled the questions by cognitive level (high or low) and question difficulty. Low-level questions focused on fact recall, while high-level questions assessed data interpretation and patient management. A difficulty index was derived, indicating the percentage of correct human answers per question bank.15 Due to the similar distribution of questions in both sets, we combined them for subsequent statistical analyses.

Accessing GPT-4 through the Application Programming Interface

ChatGPT (OpenAI, San Francisco) is a chatbot application that was originally based on a fine-tuned model from the GPT-3.5 series called ‘gpt-3.5-turbo’.17 In March 2023, OpenAI released GPT-4, a new generation LLM exhibiting human-level performance on various academic benchmarks, surpassing GPT-3.5.4 GPT-4 became available to the public through a limited research preview on the ChatGPT application and through the Application Programming Interface (API). We gained early access to GPT-4 via its API and used it in this research. Using GPT-4 through the API grants unrestricted access to GPT-4 and assures data privacy as the data is not used to enhance the GPT-4 model—a contrast to the research preview available on the ChatGPT application. Moreover, it facilitates integration with other software such as Google Sheets, enabling mass prompting and automation.

Adjusting GPT-4’s temperature

GPT-4 is probabilistic by design, which means it can produce varying responses when given identical prompts. The degree of this variability can be manipulated via the ‘temperature’ parameter. The ideal temperature setting depends on the specific use case and is often determined a priori based on an educated guess.4 It is generally understood that a temperature of 0 yields coherent and conservative results, while a temperature of 1 fosters high creativity at the expense of coherence. To our knowledge, the ideal temperature for GPT-4 has not yet been defined in the realm of ophthalmology question-answering. Consequently, we decided to identify the optimal temperature for our use case by testing GPT-4 at four distinct temperature settings. For ease of reference in this paper, we will label these as GPT-4-0 for temperature 0, GPT-4-0.3 for 0.3, GPT-4-0.7 for 0.7 and GPT-4-1 for temperature 1.

Human evaluation of GPT-4 responses

We carried out human evaluations of long-form responses produced by the GPT-4 models with different temperatures. We randomly sampled 50 questions, with 25 each from BCSC and OphthoQuestions, without controlling for difficulty index, cognitive level, exam section or response accuracy. Our three raters consisted of a recently board-certified ophthalmologist who excelled in OphthoQuestions (Top 10 leaderboards), and two ophthalmology residents from Canada in their third and a fourth years of training. In line with the approach proposed by Singhal et al, we directed our raters to rank the model responses based on alignment with medical consensus, knowledge recall, inclusion of irrelevant content and omission of important information.9 These factors were not judged individually; instead, the raters assigned a comprehensive rank considering all of these domains, with the freedom to weigh them as they deemed appropriate.

Formatting questions and zero-shot prompting

We maintained the original multiple-choice format of questions with one correct answer and three incorrect options (distractors). We employed a zero-shot approach for the lead-in prompt like in our previous study because this technique is the closest to human test-taking.8 We used the prompt ‘please select the correct answer and provide an explanation’ followed by the question and answer options (figure 1).

Figure 1

Example of GPT-4-0.3’s correct response to this question from the neuro-ophthalmology section of the OphthoQuestions dataset. For reference, this question is considered high cognitive level and of easy difficulty (88% of humans answered correctly). GPT-4, Generative Pre-trained Transformer 4.

Historical human performance on BCSC and OphthoQuestions

To contextualise the performance of GPT-4, we gathered historical data on human performance for each of the data sets. This information was provided per section and showed average human performance on all the 4458 BCSC and 4539 OphthoQuestions questions. However, these figures do not represent the average accuracies for the sample exams of 520 questions that were used to evaluate GPT-4’s performance, as this specific data is not available. The BCSC platform offered average peer scores, but these did not include a breakdown by year of training or any historical data. OphthoQuestions provided historical data that matched the user’s year of training. The mean accuracies were computed using data from three sequential years of training: the first year (2019–2020), the second year (2020–2021) and the third year (2021–2022). Considering this limitation, we decided to adopt a cautious analysis strategy, opting not to establish a non-inferiority threshold. Instead, we assessed whether the performance of GPT-4 differed from that of humans.

Statistical analysis

We determined accuracy by comparing GPT-4 answers to the answer key provided by the question banks. For each GPT-4 model, accuracy was determined using a single run, as we have previously shown substantial to almost perfect repeatability of GPT-3.5.10 To compare answer accuracy across different models, we employed a generalised estimating equations using an exchangeable correlation structure and a binomial distribution with a logit link. Since the models were tested on the same questions, we employed geepack to allow modelling of correlated data. When we found significant effects, we performed post hoc analyses and applied Tukey corrections to the p values. For human evaluation of GPT-4, we measured rater agreement using Kendall’s W. We analysed clinician ratings across different GPT-4 models with analysis of variance and adjusted using Tukey’s method for post hoc analyses. We used logistic regression to study the influence of exam section, cognitive level and difficulty on model accuracy. Given that we are dealing with a dichotomous outcome (correct or incorrect answers), we present our results in the form of area under the receiver operating characteristic curve (AUC). ORs could not be used to assess the importance of each variable because we were dealing with both categorical (exam section, cognitive level) and continuous (difficulty index) variables, which makes them non-comparable. We employed Tukey’s test to evaluate the effect of each variable while controlling for others. Lastly, we carried out a meta-analysis to compare the best-performing GPT-4 model with historical human data, making adjustments with the metafor package. We used R V.4.3.1 for our analyses at a 5% alpha level.

Results

Model temperature does not impact overall accuracy or section performance

Among the GPT-4 models with various temperatures, GPT-4-0.3 achieved the highest combined accuracy. It reached 72.9%, with 75.8% accuracy on the BCSC set and 70.0% on the OphthoQuestions set. Comparatively, the lowest overall accuracy was achieved by GPT-4-0, scoring 71.7%. The maximum difference in overall performance between the best and worst performing models was marginal (1.2%), which is equivalent to 6 questions on the 520 question set. There was no statistically significant difference between the GPT-4 models (p=0.49). The results are summarised in table 1 and online supplemental table 1. There were anecdotal performance variations across different exam sections for each GPT-4 model as seen in figure 2, these differences did not reach statistical significance (p=0.27) (online supplemental figure 1).

Supplemental material

Table 1

Comparison of GPT-4 models results at different temperatures

Figure 2

Performance comparison of GPT Models Across Exam Sections for BCSC and OphthoQuestions. This heatmap provides a colour-coded representation of the performance scores of the various GPT models with varying temperatures across different examination sections and question banks. The scores (percentage) are represented as integers, annotated within each cell and the colours vary from light yellow to dark purple, with lighter colours representing higher performance scores according to the viridis colour palette. BCSC, Basic and Clinical Science Course; GPT, Generative Pre-trained Transformer.

Human raters preferred more probabilistic (creative) models

Our three human raters were in substantial agreement, with a Kendall’s W of 0.744 (95% CI 0.519 to 0.804), as illustrated in online supplemental figure 2. The mean rankings for the different GPT-4 models were as follows: GPT-4-0 ranked 3.4 (±0.7), GPT-4-0.3 ranked 2.4 (±0.8) and both GPT-4-0.7 and GPT-4-1 ranked 2.1, with an SD of 0.8 and 0.9, respectively. Based on the mean rankings, GPT-4-0 was least preferred compared with all other GPT-4 models (p<0.001). There were no statistically significant differences in ranking between the remaining GPT-4-0.3, GPT-4-–0.7 and GPT-4-1 models (online supplemental table 2). An example of different GPT-4 responses and the preferred ranking is shown in figure 3.

Figure 3

Illustrative example comparing responses and rankings across different GPT-4 models. This figure presents an example of responses generated by the GPT-4 models to a question from the neuro-ophthalmology section of OphthoQuestions. This question was of moderate difficulty, with a 67% correct response rate among human responders on OphthoQuestions. All four models provided the correct answer. The response of GPT-4-1 was favoured by clinicians (ranked #1). The GPT-4-1 response is notable for its structured layout: explicitly stating the diagnosis, presenting the pathophysiology (mentioning the important ‘Guillain-Mollaret’ triangle), before discussing a potential aetiology and clinical findings. However, the explanation is inaccurate (hallucination), as this type of nystagmus is typically described as ‘pendular and vertical’ rather than horizontal. GPT-4, Generative Pre-trained Transformer 4.

GPT-4-0.3 outperforms GPT-3.5

In all exam sections and across various temperature settings, GPT-4’s performance was either on par with or exceeded that of GPT-3.5. The shift from darker to lighter colours on the heatmap in figure 2 demonstrates this superior performance. There were two exceptions: the glaucoma section (BCSC) during the GPT-4-0.7 run and the Clinical Optics section (OphthoQuestions) during the GPT-4-1 run. GPT-4-0.3 outperformed GPT-3.5 by a statistically significant margin, presenting an 18.3% improvement in raw accuracy (p<0.001). There were improvements in multiple exam sections, particularly in lens and cataract, oculofacial plastic and orbital surgery, and retina and vitreous as seen in (online supplemental table 3).

GPT-4-0.3’s accuracy depends on exam section, cognitive level and question difficulty

Taking the datasets together, GPT-4-0.3 performed best in retina and vitreous (85%), general medicine (82.5%) and lens and cataract (82.5%), but not as well in paediatrics and strabismus (62.5%), glaucoma (62.5%) and clinical optics (60%). Answer accuracy was most dependent on question difficulty (AUC=0.69), followed by exam section (AUC=0.60) and cognitive level (AUC=0.56), as seen in online supplemental figure 3. We also found that accuracy improved with increased difficulty index (easier questions) while controlling for the examination section and cognitive level. Similar effects were seen for cognitive level (better performance on low cognitive level questions) when controlling for the other two factors. On post hoc analyses, while controlling for question difficulty and cognitive level, there were significant differences in performance between numerous exam sections as seen in online supplemental figure 4. For example, compared with its strongest performance in retina and vitreous, GPT-4-0.3 performed significantly worse in paediatrics and strabismus (p=0.017), oculoplastics (p=0.045) and glaucoma (p=0.002).

GPT-4-0.3’s accuracy is not different from human-level performance

We compared the performance of GPT-4 to historical human performance as reported in the BCSC and OphthoQuestions platforms (online supplemental table 4). The mean accuracy of the GPT-4-0.3 model outperformed historical human averages on both the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%) datasets. As depicted in figure 4, GPT-4-0.3 tends to exhibit superior performance compared with human responders, although this varies across different sections. However, an effect size analysis showed no statistically significant difference in performance between GPT-4-0.3 and historical human performance for each of BCSC (p=0.55) and OphthoQuestions (p=0.09). The analyses are shown in online supplemental figure 5.

Figure 4

Performance of GPT-3.5 and GPT-4 compared with historical human performance data. GPT-4-0.3 significantly outperforms GPT-3.5 overall (p<0.001) and on most examination sections. While it exceeds historical human performance in some sections, there was no significant difference in performance between GPT-4-0.3 and humans for each of BCSC (p=0.55) and OphthoQuestions (p=0.09), and overall (p=0.10). BCSC, Basic and Clinical Science Course; GPT, Generative Pre-trained Transformer.

Discussion

In a previous study, we curated two question datasets from the BCSC and OphthoQuestions to test GPT-3.5 and examine determinants of its performance in the ophthalmology question-answering domain.10 In this study, we evaluated an updated iteration of GPT, specifically GPT-4, across a spectrum of temperature settings ranging from 0 to 1, which control the creativity of model responses.

The GPT-4 model with a temperature of 0.3 (GPT-4-0.3) had the highest numeric accuracy, but there were no statistically significant differences between GPT-4 models with different temperatures. GPT-4-0.3 had 75.8% accuracy on the BCSC set and 70.0% on the OphthoQuestions set, and a combined overall accuracy of 72.9%. GPT-4-0.3 performed similarly to other GPT-4 models across different exam sections. To our knowledge, the optimal temperature setting for question answering in ophthalmology is not known. The GPT-4 technical report mentions the use of a 0.3 temperature for multiple-choice questions and a 0.6 temperature for free response questions, although the authors clarify that this is merely their best estimation.4 We found that human raters preferred responses from models operating at temperatures of 0.3, 0.7 or 1, compared with a temperature of 0. We speculate that they were favoured because they are more creative, and possibly pulled from a wider range of knowledge, thereby proving more useful for learning compared with rigid responses. However, creative abilities in LLMs can lead to ‘hallucinations’, which are model responses that are plausible sounding but factually inaccurate.18 As such, we believe that a temperature setting ranging from 0.3 to 0.7 is a more secure threshold compared with 1 in medical question-answering. Nevertheless, more comprehensive experiments are necessary to confirm this with certainty.

LLMs are prone to hallucinations for a variety of reasons, including the potential presence of inaccuracies in the training corpora, and mismatches between the prompted topic and the availability of training data.19 Those can lead the model to generate responses that prioritise dialogue flow over factual accuracy, in an effort to be engaging and to ‘please’ the user.18 20 Practically, hallucinations can be hard to detect due to the phenomenon of ‘automation bias’.21 We illustrate such an example in figure 3 where our graders preferred the more creative response, effectively overlooking a hallucination about the direction of the nystagmus in oculoplatal myoclonus. Automation bias, or over-reliance on AI systems, is more frequently seen in situations involving complex tasks and high workloads.21 In our study, the grading task was multidimensional and required the graders to weight multiple factors at once. This may have led them to overlook hallucinations in certain cases. Those are especially hard to detect when they occur within a context of otherwise accurate information, as we demonstrate in figure 3. Various approaches have been proposed to mitigate the occurrence of hallucinations, which can be categorised into model-level techniques—applied during the training or fine-tuning—and enhanced prompting strategies, such chain-of-thought prompting.19 22 For clinicians, being cognizant of automation bias can facilitate the detection of erroneous AI outputs.23

GPT-4-0.3 (72.9%) showed an 18.3% improvement over GPT-3.5 (54.6%) when tested on the same dataset. This improvement was statistically significant. Similar improved accuracy was reported by Med-PaLM2 when tested on USMLE-style general questions, which reported 19% improvement compared with its predecessor MedPaLM.9 Similar magnitude of improvements were reported in ophthalmology question answering literature. Mihalache and colleagues evaluated the performance of the research preview of GPT-4 (ChatGPT) on a small sample dataset from OphthoQuestions, finding an impressive jump from 58% with GPT-3.5 to 84% with GPT-4.11 Their reported accuracy exceeds our findings with OphthoQuestions (70%). However, since Mihalache et al used public domain questions from OphthoQuestions free trial, this might have contaminated ChatGPT’s training, leading to overestimation of accuracy.24 Similarly, Teebagy et al saw an improvement of 24%, from 57% to 81%, when assessing the BCSC question set using GPT-4.13 Meanwhile, Cai et al reported similar results to ours for GPT-3.5 (58.8%) and GPT-4 (71.6%) when using BCSC questions.25 The discrepancies in reported GPT-4 accuracy could be attributed to differences in the sample datasets (different question difficulties and cognitive level distributions), or even inherent variability in model performance. Indeed, there have been reports of inconsistent behaviour of GPT-4 over time.26 This raises crucial questions regarding the reproducibility of results from LLMs and issues related to their integration in clinical workflows, particularly if their performance is unpredictable.

We found that GPT-4’s answer accuracy depends first on question difficulty, followed by exam section and cognitive level. Simpler, low cognitive level questions—those akin to recall tasks—yield better performance than complex, clinical decision-making ones. While this observation might seem intuitive, its empirical demonstration through our experiments is important. This suggests that GPT-4’s current strength lies in memorisation-based questions, hinting at limitations in advanced reasoning. Our study also found performance variances across ophthalmic subspecialties like glaucoma and ocular oncology, even after controlling for question difficulty and cognitive level. Those variations could be attributed to either study-level factors related to the BCSC and OphthoQuestions datasets, or more broadly related to the ‘intelligence’ of GPT-4. From a dataset perspective, divergent writing styles across subspecialty clinical vignettes, each curated by a unique group of domain experts or editors, could explain some performance differences. Likewise, for questions with a similar cognitive level, some subspecialties may favour single-step decision-making (eg, clinical medicine), as opposed to others than necessitate simultaneous recall and decision-making (eg, clinical optics). From a model perspective, such discrepancies reinforce the notion that, although LLMs are trained on a broad corpora of text, their knowledge representation might not uniformly cover all domain subspecialties. In ophthalmology, this could be attributed to factors like the volume of learning materials available online for specific topics (a factor of disease prevalence), the frequency of publications and other related metrics.

GPT-4-0.3 surpassed historical human scores on the BCSC and OphthoQuestions datasets, individually and combined, with variations observed in different exam sections. This result is important. Yet, when analysing effect size, the difference was not statistically significant. Such analysis is vital to contextualise GPT-4’s performance, especially given its high accuracy. We believe that our analysis provides compelling evidence that GPT-4’s performance is on par with human performance in the ophthalmology question answering domain. To determine that, we used the aggregate performance of trainees on question banks as a proxy for human-level performance. This method is interesting because the averages we reference come from the experience of thousands of international ophthalmology trainees (residents and fellows), obtained across multiple years and averaged on more than 8000 questions. Nonetheless, given that these questions are intended as an educational tool, trainees might not perform at their best on those question banks, potentially lowering the average scores. This is balanced out by the likelihood that users, when revisiting questions they have seen before, will answer them correctly.

Despite our encouraging results, we emphasise that this should not be interpreted as suggesting that GPT-4 operates at the same proficiency as a human ophthalmologist overall. Just as with human trainees, performance on multiple-choice examinations (or exam scores) do not dictate overall clinical competency. Such metrics overlook crucial physician competencies, like communication, professionalism and collaboration.27 While our study reports on the accuracy of GPT-4 in ophthalmology question-answering, it does not delve into the real-world clinical implications of GPT-4 or LLMs in general. To truly showcase their clinical benefits, we need to use holistic indicators that matter in healthcare like multimodality, applicability, cost-effectiveness and others.28 In ophthalmology, multimodality is crucial, as we frequently depend on imaging data for patient diagnosis and monitoring. LLMs that are integrated into systems equipped for multimodal inputs are poised to be the most beneficial. Over time, building specialised in-domain training for LLMs could prove valuable, and the development of bespoke foundation models for ophthalmology would be optimal.29

Data availability statement

Data may be obtained from a third party and are not publicly available. All data produced in the present study are available on reasonable request to the authors. The specific question sets are proprietary of the BCSC Self-Assessment Program and OphthoQuestions and cannot be shared.

Ethics statements

Patient consent for publication

Acknowledgments

We are deeply grateful to the American Academy of Ophthalmology for generously granting us permission to use the underlying BCSC Self-Assessment Program materials.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • PAK and RD are joint senior authors.

  • X @FaresAntaki, @pearsekeane

  • Contributors Overall responsibility/ guarantor (RD); Conception and design of the study (FA, MAC, PAK and RD); data collection (FA, DM, ST and JE-K); data analysis (FA and C-EG); writing of the manuscript and preparation of figures (FA and MAC); supervision (PAK and RD); review and discussion of the results (all authors); edition and revision of the manuscript (all authors).

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Linked Articles

  • Highlights from this issue
    Frank Larkin