Article Text
Abstract
Background/aims To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).
Methods We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as ‘Good’, ‘Borderline’ or ‘Poor’ quality.
Results Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10−3). Based on the consensus approach, 83.3% of ChatGPT-4’s responses and 86.7% of ChatGPT-3.5’s were rated as ‘Good’, surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10−2). ChatGPT-4 and ChatGPT-3.5 had no ‘Poor’ rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.
Conclusion ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.
- Macula
- Public health
- Retina
Data availability statement
No data are available.
Statistics from Altmetric.com
Data availability statement
No data are available.
Footnotes
X @Yih Chung Tham
SJS and YCT contributed equally.
Contributors KXC, CZ, KYCT, TYW, SJS and YCT conceived and designed the study. KXC, CZ, T-ET, BJF and WMW collected data. KXC, CZ and YCT analysed and interpreted the data. KXC, CZ, SS, PAK, CSL, AYL, CMGC, TYW, SJS and YCT wrote the manuscript. YCT is the guarantor.
Funding This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub).
Disclaimer The funder had no role in study design, data collection, data analysis, data interpretation or writing of the report.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
Linked Articles
- Highlights from this issue