Article Text
Abstract
Background/aims This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.
Methods We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.
Results Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).
Conclusion Improved prompting enhances GPT-4’s performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
Data availability statement
Data are available on reasonable request. All data produced in the present study is available on reasonable request to the authors. JAMA Ophthalmology’s Clinical Challenges are proprietary and access is available to be distributed through their website.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Clinicians are exploring the use of large language models (LLMs) like Generative Pre-trained Transformer (GPT) to improve diagnostic accuracy and clinical decision-making in medicine, notably in ophthalmology. Studies show that GPT-4 outperforms previous models in ophthalmology question banks, but its text generation method reveals limitations in critical thinking. Early research using ophthalmology case reports suggests a high agreement between LLMs and experts, yet the application of LLMs in a large set of ophthalmology clinical challenges remains unexplored.
WHAT THIS STUDY ADDS
This study assesses GPT-4’s performance on ophthalmological cases featured in the Journal of the American Medical Association Ophthalmology Clinical Challenges section, showcasing its diagnostic and decision-making capabilities. It also evaluates the efficacy of various prompting strategies and positions GPT-4’s performance in relation to ophthalmology trainees.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This study underscores the potential of LLMs within ophthalmology, suggesting a future where AI complements clinical expertise. By demonstrating that GPT-4 can achieve commendable performance in complex ophthalmology cases, this study may catalyse the discussion on integrating AI in clinical decision support systems and encourage policy frameworks that facilitate the responsible deployment of LLMs in patient care.
Introduction
Globally, clinicians and scientists alike are contemplating the potential uses of large language models (LLMs) in improving diagnostic precision and supporting clinical decision-making processes.1 LLMs, which represent fine-tuned foundation models trained on large datasets, can produce coherent text and demonstrate complex reasoning capabilities.2–5 Generative Pre-trained Transformer (GPT)-4 currently sets the industry standard in the LLM domain, showing considerable improvements over its predecessors in the medical domain.4 Notably, GPT-4’s diagnostic and clinical decision-making abilities seem to be enhanced as it continues to learn.5
In ophthalmology, our group has previously studied the performance of GPT in medical question answering. We have shown that GPT-4 can achieve an accuracy of 72.9% on the large ophthalmology question banks, outperforming GPT-3.5 by 18.3%.6 7 Since our original work, numerous subsequent studies have corroborated our findings.6–11 We have also shown that GPT-4 performs best in recall questions compared with ones involving clinical decision-making.7 Thus, the ability of LLMs to engage in true critical thinking, beyond simply generating text by predicting the next most probable word, or ‘token’, remains to be determined.5
Evaluating the performance of LLMs on diagnosing case reports from the literature may be useful to determine how well they can handle complex, real-world medical cases. To date, only a handful of studies have studied that in ophthalmology, with sample sizes between 11 and 22 cases covering neuro-ophthalmology, glaucoma and cornea.12–14 These initial findings indicate a high level of agreement between LLMs and experts, highlighting a potential role for LLMs in clinical decision-making.
In this work, we explore the performance of GPT-4 in answering questions about complex ophthalmological cases published in the Journal of the American Medical Association (JAMA) Ophthalmology Clinical Challenges section. These reports represent challenging ophthalmological cases, where clinicians attempt to determine the diagnosis (open-ended) and the best next diagnostic or treatment step (multiple-choice question). We explore multiple prompting strategies to enhance the performance model. We then compare this performance to the accuracy of ophthalmology trainees as a benchmark.
Materials and methods
JAMA Ophthalmology’s Clinical Challenges
In July 2023, GPT-4 was prompted using 422 case studies from JAMA Ophthalmology’s Clinical Challenges section. These case studies were designed to assess both diagnostic prediction and identification of the best next step using a multiple-choice question. The challenges were classified in 1 of the 13 ophthalmology subspecialties, as categorised by the American Academy of Ophthalmology in their Basic and Clinical Science Course.15 The study title, case and figure descriptions were provided to GPT-4 and human graders. Figures were excluded as GPT-4 could not process images at the time of writing (August 2023). Discussions were also excluded to avoid data leakage, as the answers were often revealed in this section.
GPT-4 access and parameters
We accessed GPT-4, OpenAI’s latest LLM, using the Application Programming Interface (API).4 This allowed us to design customised automated mass prompting techniques using Google Sheets. The API, unlike the ChatGPT web application, guarantees data privacy by not using user data to enhance the GPT model. Furthermore, GPT-4’s ‘temperature’, referring to the degree of randomness in its responses when given identical prompts, was set to 0.3. The temperature scale goes from 0 to 1, with 0 yielding the most conservative responses, and 1 yielding highly creative responses. Although the ideal temperature has not yet been defined for this use case, our most recent paper determined that a temperature of 0.3 achieved the highest accuracy.4 7
Prompt engineering
The ‘What to Do Next?’ questions from JAMA Ophthalmology’s Clinical Challenges follow a standardised multiple-choice question format, with one correct option and three incorrect options (distractors). The exact same information (case report, multiple-choice question and answer options) was provided to GPT-4 and human graders.
Recent studies have shown that different strategies of zero-shot prompting lead to different results.16 Thus, we compared the use of two zero-shot prompting strategies: the first consisted of what our team collectively agreed would be most logical, while the second consisted of a zero-shot plan-and-solve+ (PS+) prompt (figure 1). Proposed by Wang et al, zero-shot-PS+ prompting consists of asking GPT to build a plan to divide the task into smaller subtasks, to then be able to carry out the subtasks with detailed instructions. Although the original PS prompting strategy described uses similar methodology, it suffers from calculation errors and low-quality reasoning steps.16 In PS+, more detailed instructions address these weaknesses. PS+ demonstrates superiority over PS and basic zero-shot chain-of-thought (CoT) strategies, such as the ‘Let’s think step by step’ prompt.16
Human benchmarking
Since historical data on human performance are not publicly available on JAMA Ophthalmology, 3 practising board-certified ophthalmologists and 3 ophthalmology trainees were recruited to answer 5 randomly selected clinical challenges from each of the 13 ophthalmology subspecialties. The ophthalmologists specialised in comprehensive ophthalmology, glaucoma and medical retina. The trainees had various levels of training: postgraduate years 2, 3 and 4. We compared the results of human graders to GPT-4 on the same subset of clinical challenges to contextualise our findings.
Statistical analysis
We compared GPT-4 answers to those provided by JAMA Ophthalmology. When grading the open-ended diagnosis questions, we prioritised specificity in evaluating correct answers. Initially, three junior trainees jointly assessed the answers. Answers were deemed correct if both the general primary diagnosis and the specific aetiology of subtype were correct. For example, if the specific aetiology was ‘acute posterior multifocal placoid pigment epitheliopathy’, mentioning only the general primary diagnosis like ‘posterior uveitis’ was deemed insufficient and marked incorrect. In another example from our dataset, if the specific aetiology was ‘UL97-resistant and UL54-resistant cytomegalovirus retinitis’, mentioning ‘cytomegalovirus retinitis’ was marked as correct. When junior trainees were unsure, further adjudication was performed by a senior clinician. The efficacy of both zero-shot prompting strategies was evaluated using generalised estimating equations (GEEs), considering the overlap in question sets. The GEE, facilitated by the geepack package, accommodated for data correlation, with significant findings further examined via post hoc analysis and Dunnet’s method for p value adjustment. Logistic regression allowed us to study the influence of subspecialty on accuracy. All analyses were conducted with R V.4.3.1 using a 5% significance threshold.
The same approach was employed to compare the performance of both GPT-4 prompting strategies to human graders. Human grader concordance was quantified using kappa statistics, with kappa values interpreting agreement levels. Kappa can be interpreted as 0–0.2 none to slight agreement, 0.21–0.4 fair agreement, 0.41–0.6 moderate agreement, 0.61–0.8 substantial agreement and 0.81–1.00 near perfect agreement. In this section, GPT-4 was tested on the subset of clinical challenges that underwent human grading; thus, the accuracy reported may differ slightly from the ones reported in the previous section.
Results
Within the collection of 422 Clinical Challenges, the sections on Retina and Vitreous, Uveitis and Neuro-Ophthalmology were notably popular, comprising 23% (96/422), 16% (67/422) and 16% (67/422) of the total, respectively. No challenges were published on the topics of refractive surgery, clinical optics and fundamentals (online supplemental figure 1).
Supplemental material
Traditional zero-shot GPT-4 prompting
Using traditional zero-shot prompting strategies, GPT-4 achieved mean accuracies of 41.5% (95% CI (36.8% to 46.3%)) and 60.4% (95% CI (55.6% to 65.1%)) in diagnosis and next step, respectively. Diagnostic and next-step accuracy did not significantly differ by subspecialty (p=0.13 and p=0.41, respectively).
We observed the following patterns: when the diagnosis was accurate, 74.9% (95% CI (67.6% to 81.0%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis is correct (p<0.001). This was seen among all subspecialty cases, with no significant differences found between subspecialties (p=0.41).
GPT-4 zero-shot plan-and-solve+prompting outperforms traditional zero-shot prompting
Using zero-shot PS+prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively (figure 2). Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027).
When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate (figure 3). The next step remained approximately three times more likely to be accurate when the initial diagnosis was correct (p<0.001).
Across all subspecialty challenges, zero-shot PS+prompting outperformed traditional zero-shot prompting in diagnostic accuracy (p=0.006) but did not show a statistically significant difference in accuracy for determining the next step (p=0.18) (table 1). There was no observed subspecialty effect in the relationship for diagnosis (p=0.13) or the next step (p=0.89) (online supplemental table 1).
Supplemental material
GPT-4 versus ophthalmologists and ophthalmology trainees
We then compared the performance of GPT-4 to the six human graders. Since the overall agreement among the board-certified ophthalmologists was moderate to substantial (kappa=0.66, 95% CI (0.44 to 0.86) for diagnostic accuracy and kappa=0.63, 95% CI (0.39 to 0.85) for next step), the comparison with GPT-4 was done with each ophthalmologist separately. There were no statistically significant differences in diagnostic performance when comparing ophthalmologists to GPT-4 zero-shot PS+, with respective accuracies of 48.9% (p=0.562), 59.6% (p=0.649) and 68.1% (p=0.477). Similarly, there were no statistically significant differences in performance for next step determination, with respective scores of 59.6% (p=0.998), 59.6% (p=0.998) and 72.3% (p=0.416) (figure 4).
The agreement among trainee graders was low (kappa=0.45, 95% CI (0.29 to 0.62) for next step and kappa=0.27, 95% CI (0.10 to 0.43) for diagnostic accuracy), and as such the comparison with GPT-4 was also done with each trainee separately. Both senior residents significantly outperformed GPT-4 zero-shot PS+ in diagnostic performance, with respective accuracies of 78.7% (p=0.049) and 85.1% (p<0.001). Similarly, both senior residents outperformed in next step determination, with respective accuracies of 78.7% (p=0.020) and 85.1% (p=0.002). There were no significant differences in diagnostic performance and next step determination when compared with the junior resident, with respective accuracies of 51.1% (p=0.75) and 57.4% (p=1.00) (figure 4).
Discussion
In this study, we demonstrate that enhanced prompting strategies can improve GPT-4’s performance, that GPT-4 performs well in complex clinical scenarios and that GPT-4 does not currently outperform ophthalmology trainees. We selected GPT-4 as the state-of-the-art LLM for this study since it has been shown to outperform its predecessors and other publicly available LLMs such as Google Bard and Claude-2.17 18
Enhanced prompting techniques have demonstrated their potential to augment the performance of GPT-4.16 19 While there are infinite prompting strategies—like few-shot CoT prompting, which provides several exemplary chains of thought to the model through multiple prompts sent by the user—such strategies were incompatible with the constraints of ChatGPT’s API, which is currently only designed for zero-shot prompting. However, zero-shot prompting can be refined to improve GPT’s accuracy. A novel advancement in this area, zero-shot PS+, entails directing the LLM to formulate a strategy by breaking down the main task into simpler subtasks, executing these with meticulous logical instructions.16 The enhancement of GPT’s performance by various prompting strategies points out a major limitation in our current methods of evaluating LLMs. Since testing all possible prompting strategies is unrealistic, there is a pressing need for standard frameworks and guidelines to evaluate LLMs in medicine.
With the implementation of zero-shot PS+, GPT-4 achieved a diagnostic accuracy of 48% and was 63% accurate in identifying the most appropriate next step. Diagnostic accuracy was significantly higher in pathology and tumours than in uveitis (p=0.027), possibly due to the complex and often difficult diagnoses in uveitis. The likelihood of subsequent step accuracy was tripled when the initial diagnosis was correct, a trend that held across various subspecialty cases. This was likely the case due to GPT’s method of generating content by predicting the next most probable ‘token’.5 While conjectural, this may indicate that GPT-4’s proficiency lies in its ability to recall information and draw rapid inferences, rather than in iterative reasoning or reevaluation of decisions as new information, such as multiple-choice answer options, is presented. Consequently, GPT-4 appears predisposed to determining the optimal next step based on its initial diagnosis, without reconsidering this decision in light of subsequent information.
Prior research has explored GPT’s utility in analysing ophthalmology case reports on a limited basis. Madadi et al detailed GPT’s concordance with neuro-ophthalmologists in 22 case reports, highlighting strong alignment with experts.12 Delsoz et al evaluated GPT’s performance on 11 glaucoma cases, with findings indicative of a diagnostic precision comparable to that of senior ophthalmology residents.13 Lastly, Delsoz et al explored GPT-4’s application to 20 cornea case reports, showcasing once again a robust performance.14 Collectively, these studies signal a growing interest and recognition of GPT’s potential in ophthalmological evaluations. The diagnostic accuracy of GPT-4 in these studies ranged from 72.7% to 85%, higher than our achieved combined diagnostic accuracy of 48%. The limited number of cases in these studies makes further comparison challenging. Beyond ophthalmology, interest persists: a study testing GPT-4 Vision (GPT-4V) on general medical cases found it outperformed physicians in 934 cases, but its performance declined when images were introduced.20
Since no historical human performance metrics were published on these clinical challenges, we created a human benchmark for performance comparison. We used the performances of ophthalmology trainees at varying educational stages—first through third years—and of practising, board-certified ophthalmologists as a benchmark. This approach was chosen to capture a snapshot of the progression in clinical proficiency and to contextualise GPT-4’s performance within the current landscape of clinical learning. Within our limited comparative framework, we observed poor agreement among trainees and higher consensus among ophthalmologists. This suggests more variability in performance among trainees, reflecting the nature of continuous learning during residency. With each year of residency representing a significant jump in knowledge, senior residents performed better as they approached board examinations. Surprisingly, GPT-4 performed similarly to the ophthalmologists, a noteworthy finding that should be interpreted with caution due to the limited sample size. Also, both senior trainees outperformed GPT-4 and the consultants included in our study. This discrepancy could be attributed to a potential sampling bias where the trainees may have been preparing for upcoming examinations, making them more familiar with the specific minutiae often presented in these cases. Additionally, it is crucial to note that two of the three ophthalmologists are subspecialists, possibly contributing to their exposure being more focused and distant from other subspecialties, unlike trainees who are currently undergoing broader training. Furthermore, the complexity of the cases might have influenced those with more clinical experience to answer based on their real-world experiences rather than adhering strictly to textbook answers, which residents are more exposed to. This adds another layer of complexity to the interpretation of our findings. The future potential for GPT-4 or subsequent language models to equal or surpass the proficiency of senior trainees—or even experienced ophthalmologists—remains a provocative and open question. Given the fast pace of innovation in this domain, it is plausible to conjecture that these models may soon approximate the diagnostic capabilities of human clinicians.
Since JAMA Ophthalmology’s Clinical Challenges are behind a paywall, it is likely that GPT was not trained on this data. However, due to the opaque nature of GPT’s training dataset, we can not know for certain. If the task training examples from JAMA Ophthalmology were included in GPT’s pretraining data, this would introduce the risk of task contamination, disrupting this study’s zero-shot nature. Recent findings demonstrate that for classification tasks with no possibility of task contamination, LLMs rarely exhibit noteworthy enhancements in both zero-shot and few-shot methodologies.21 This critical limitation, when applied broadly to LLMs, may constrain their overall potential, revealing that they may not evolve and learn as rapidly as initially speculated. While our results are promising, they should not be misconstrued as an indication that GPT-4’s operational proficiency is equivalent to that of an ophthalmologist. Performance on online clinical challenges, much like for physicians, does not encompass the full spectrum of the practice of medicine. Soft skills, such as communication, professionalism and bedside manner all represent essential skills which are not accounted for in this evaluation.22 23 Our study design intentionally focused on using a single, highly vetted dataset with a large sample size. The dataset from JAMA Ophthalmology, with their rigorous review process and low acceptance rate, ensures a high level of quality. However, it is important to acknowledge that this single dataset may suffer from publication bias, potentially containing more impressive cases than what ophthalmologists encounter in their daily practice. This highlights the need for the development and availability of new benchmarking datasets for research purposes in ophthalmology.
In ophthalmology, a specialty that heavily relies on imaging, the forthcoming GPT-4 Vision, an extension of the GPT-4 model, aims to add visual information processing, representing a significant step towards creating large multimodal models (LMMs) that can handle the complexities of medical data.24 This advancement could revolutionise our approach, allowing us to include image data from clinical challenges in our evaluations. Specialised foundation models designed for ophthalmology are expected to greatly influence our field, and we are now starting to see their emergence.25 In the future, accuracy, safety and validity of these specialised LLMs will need to be assessed before considering clinical implementation.26 Lastly, there is a growing trend towards using LLMs for generating differential diagnoses, employing a few-shot prompting technique characterised by multiple prompts, each adding new clinical information to iteratively refine the final list of potential diagnoses. This approach will likely offer the greatest utility to clinicians and should be prioritised in future projects.27
To conclude, GPT-4’s performance on complex clinical challenges in ophthalmology is promising, although it does not yet rival the expertise of human trainees. Currently, it will likely play a strong role in educational settings, suggesting a valuable role for specialised LLMs in the future of medical decision assistance.
Data availability statement
Data are available on reasonable request. All data produced in the present study is available on reasonable request to the authors. JAMA Ophthalmology’s Clinical Challenges are proprietary and access is available to be distributed through their website.
Ethics statements
Patient consent for publication
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
X @DanielMiladMD, @FaresAntaki
Contributors Conception and design of the study (DMilad, FA and RD); data collection (DMilad, FA, JM, AF, TK, DMikhail, ST, AB, A-AS, TN and GAM); data analysis (DMilad, FA, JM, DMikhail and C-EG); writing of the manuscript and preparation of figures (DMilad, FA, JM and DMikhail); supervision (RD); review and discussion of the results (all authors); edition and revision of the manuscript (all authors). DM and RD act as guarantors and accept full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
Linked Articles
- Highlights from this issue