Article Text

Download PDFPDF
Clinical science
Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy
  1. Kai Xiong Cheong1,
  2. Chenxi Zhang2,
  3. Tien-En Tan1,
  4. Beau J Fenner1,3,
  5. Wendy Meihua Wong4,5,
  6. Kelvin YC Teo1,3,
  7. Ya Xing Wang6,
  8. Sobha Sivaprasad7,
  9. Pearse A Keane8,
  10. Cecilia Sungmin Lee9,
  11. Aaron Y Lee9,
  12. Chui Ming Gemmy Cheung1,3,
  13. Tien Yin Wong10,11,
  14. Yun-Gyung Cheong12,
  15. Su Jeong Song13,
  16. Yih Chung Tham1,3,5
  1. 1 Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
  2. 2 Chinese Academy of Medical Sciences & Peking Union Medical College Hospital, Beijing, China
  3. 3 Ophthalmology & Visual Sciences Academic Clinical Program (Eye-ACP), Duke-NUS Medical School, Singapore
  4. 4 Department of Ophthalmology, National University Hospital, Singapore
  5. 5 Centre for Innovation and Precision Eye Health; and Department of Ophthalmology, National University of Singapore Yong Loo Lin School of Medicine, Singapore
  6. 6 Beijing Institute of Ophthalmology, Beijing Tongren Hospital, Capital University of Medical Science, Beijing, China
  7. 7 Moorfields Eye Hospital NHS Foundation Trust, London, UK
  8. 8 Medical Retina, Moorfields Eye Hospital NHS Foundation Trust, London, UK
  9. 9 Department of Ophthalmology, University of Washington, Seattle, Washington, USA
  10. 10 Tsinghua Medicine, Tsinghua University, Beijing, China
  11. 11 School of Clinical Medicine, Beijing Tsinghua Changgung Hospital, Beijing, People's Republic of China
  12. 12 Sungkyunkwan University, Jongno-gu, Seoul, South Korea
  13. 13 Kangbuk Samsung Hospital, Jongno-gu, Seoul, South Korea
  1. Correspondence to Dr Yih Chung Tham; thamyc{at}nus.edu.sg

Abstract

Background/aims To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).

Methods We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as ‘Good’, ‘Borderline’ or ‘Poor’ quality.

Results Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10−3). Based on the consensus approach, 83.3% of ChatGPT-4’s responses and 86.7% of ChatGPT-3.5’s were rated as ‘Good’, surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10−2). ChatGPT-4 and ChatGPT-3.5 had no ‘Poor’ rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.

Conclusion ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.

  • Macula
  • Public health
  • Retina

Data availability statement

No data are available.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

No data are available.

View Full Text

Footnotes

  • X @Yih Chung Tham

  • SJS and YCT contributed equally.

  • Contributors KXC, CZ, KYCT, TYW, SJS and YCT conceived and designed the study. KXC, CZ, T-ET, BJF and WMW collected data. KXC, CZ and YCT analysed and interpreted the data. KXC, CZ, SS, PAK, CSL, AYL, CMGC, TYW, SJS and YCT wrote the manuscript. YCT is the guarantor.

  • Funding This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub).

  • Disclaimer The funder had no role in study design, data collection, data analysis, data interpretation or writing of the report.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Linked Articles

  • Highlights from this issue
    Frank Larkin