Article Text
Abstract
Background Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT’s training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China.
Methods We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH’s performance to that of human doctors.
Results In the ophthalmic exam, MOPH’s average score closely aligned with the mean score of trainees (64.7 (range 62–68) vs 66.2 (range 50–92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4–5). Only 6.7% (2/30, Likert scale 1–2) and 10% (3/30, Likert scale 3) of responses were rated as ‘poor or very poor’ or ‘potentially misinterpretable inaccuracies’ by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant.
Conclusion This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.
- Public health
- Medical Education
Data availability statement
Data are available upon reasonable request. Data are available on reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Artificial intelligence-based large language model (LLM) has significant implications in medical applications, and it has shown great potential in the diagnosis of ophthalmic conditions and preparation of the board certification in the field of ophthalmology. While there is a lack of LLM training with non-English languages and explicit up-to-date medical domains.
WHAT THIS STUDY ADDS
This study developed a Chinese-specific LLM of ophthalmology (MOPH), and further demonstrated its accuracy and reliability in three different clinical scenarios.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
Our exploration revolves around safeguarding user privacy and security while leveraging LLMs in the healthcare domain, and based on this, further researches can be continued to evaluate MOPH’s real-world clinical performance.
Introduction
Artificial intelligence (AI) has been expanding its applications in various medical domains, such as image analysis, patients’ risk stratification and clinical note processing.1 2 However, recent AI advancements mostly focus on narrow and well-defined tasks and challenges, such as detecting diabetic retinopathy from fundus images.3 With the rapid development of the latest generation of AI models, which are trained on massive and diverse datasets, the aspiration of moving from the ‘narrow AI’ to ‘artificial general intelligence’ (AGI) demonstrated broad capabilities of intelligence. A noteworthy recent development is ChatGPT (Open AI, San Francisco, California), an AI-based large language model (LLM) that has significant implications in diverse scientific and medical applications.4 5 These neural network models are based on the Transformer architecture and trained on massive corpora of web-text data and can be applied to numerous downstream tasks. In the field of ophthalmology, researches have been conducted to evaluate the performance and potential of LLMs.6 7 It was showed that ChatGPT answered approximately half of the questions correctly in the OphthoQuestions free trial for ophthalmic board certification preparation.8 Another study has even found that ChatGPT has the potential in the diagnosis of ophthalmic conditions, particularly for primary care providers.9
However, LLMs have crucial limitations. For instance, ChatGPT’s training predominantly relies on English-centric internet data, which may impact the quality and diversity of their outputs, especially for non-English languages and domains.10 Additionally, LLMs, usually deployed remotely, may have access to a wide range of patient characteristics, posing serious privacy risks.11 Furthermore, ChatGPT’s knowledge cut-off is 2021, making its medical field output potentially outdated.
In this study, we aimed to develop an LLM of ophthalmology (MOPH) using Chinese corpora. We further assessed MOPH’s performance in three different clinical scenarios: ophthalmic board exams in Chinese, answering ophthalmic questions following evidence-based medicine (EBM) and diagnostic accuracy for clinical vignettes. Our exploration revolves around safeguarding user privacy and security while leveraging LLMs in the healthcare domain.
Materials and methods
Overview
This study aimed to develop a Chinese LLM that can be deployed locally and dedicated to ophthalmology. We also tested its early AGI ability in various clinical scenarios. This observational study was approved by the Xinhua Hospital Ethics Review Committee (Approval No. XHEC-D-2023–131), and the study protocol followed the tenets of the Declaration of Helsinki. The review committee indicated that patient consent was not required in this research as we only used publicly accessible or deidentified data.
Development of MOPH in Chinese
We developed MOPH by adopting the open-source LLM (ChatGLM2-6B). ChatGLM2-6B is an open bilingual language model based on General Language Model (GLM) framework.12 13 In brief, ChatGLM2-6B was trained on about one trillion tokens—equally of Chinese and English corpora, enabling the model to perform well in both languages (see Section A in the online supplemental material 1).
Supplemental material
To customise the ChatGLM2-6B for our application scenarios, we first adopted prompt engineering to preprocess the users’ input. Prompt engineering involves creating prompts based on specific questions or statements within a specific domain. This approach allowed us to leverage MOPH’s semantic understanding while also providing the model with the most relevant information. We used publicly available and self-built Chinese ophthalmic knowledge databases, mainly referring to ophthalmic textbooks, guidelines and selected review papers in Chinese and AAO Eyewiki (translated by Internet Explorer’s built-in function (online supplemental material 2).14 15 To further address the unreliable and deceptive output of LLM, we performed prompt tuning (p-tuning) on our fine-tuning dataset in Chinese to refine MOPH (see Sections B and C in online supplemental material 1). We only chose Chinese ophthalmic contents for p-tuning purpose. P-tuning is an efficient fine-tuning technique that optimises continuous prompts, significantly reducing storage and memory usage per task. It has been shown to performs comparably to full parameter fine tuning with only 0.1%–3% of the fine-tuning parameters.16 Figure 1 illustrates the implementation of MOPH’s framework.
Supplemental material
We conducted the p-tuning using the source codes from ChatGLM2-6B’s GitHub.12 The hyperparameters employed in the training process were as follows: the batch size of 1, a learning rate of 2e-4 with gradient accumulation steps of 16, a maximum source length of 128 tokens, and a maximum target length of 512 tokens. For prompt engineering, we used GanymedeNil/text2vec-large-Chinese for embedding and Facebook AI Similarity Search (Faiss) for efficient similarity search and clustering of dense vectors.17 18 Figure 2 illustrates the details the prompt engineering and prompting generation process in our study. The hardware for MOPH training included an Intel 8th generation central processing unit (i5–8400, 2.81 GHz, 32 GB main memory) and two NVIDIA A4000 GPUs for 35 hours.
Evaluation of MOPH in ophthalmology
Some researchers believe that LLMs, such as ChatGPT, could be viewed as an early (yet still incomplete) version of the AGI system. Inspired by that, we propose here three clinical scenarios to investigate the capabilities of our MOPH model.
We first test the performance of MOPH in the Board of Ophthalmic Exams in Shanghai, China. We used a dataset of single-choice questions (SCQ) and the Written Qualifying Exam (WQE) from OphthoQuestions of the National Medical E-Book Packages, a common resource for board certification examination preparation.19 We only included text-based questions and excluded questions requiring the input of images. We also asked the trainees in the same department to take the mock exam and compare their scores with MOPH’s scores. Three senior ophthalmologists (all with over 10 years of clinical experience) independently reviewed each answer of WQEs. The overall mean score was determined by averaging the scores given by each grader. To avoid confirmation bias, we did not tell the graders in advance that the language model and humans were taking the exam together.
In the second clinical scenario, we investigated whether MOPH can respond following EBM. Based on clinical guidelines of the Chinese Medical Association, we generated 30 questions on the following six subspecialties of ophthalmology: glaucoma, lens and cataract, paediatric ophthalmology and strabismus, retina and vitreous, external disease and cornea, and uveitis and ocular inflammation (Section A in the online supplemental material 3). Three graders (more than 10 years of clinical experience) assessed the MOPH’s responses using a Likert scale from 1 to 5 (1: very poor/unacceptable inaccuracies, 2: poor/minor potentially harmful inaccuracies, 3: moderate/potentially misinterpretable inaccuracies, 4: good/only minor non-harmful inaccuracies, 5: very good).
Supplemental material
We evaluate the diagnostic accuracy of outpatient clinic notes from the clinical setting. We deidentified the clinical vignettes that only included: the patient’s chief complaints, present illness, past ocular history, ocular medications, general medical and surgical history and physical examination with vital signs following the electronic medical record of the Hospital Information System, Xinhua Hospital. The following subspecialties of ophthalmology were included (30 clinical vignettes for each sub-specialty): glaucoma, lens and cataract, paediatric ophthalmology and strabismus, retina and vitreous, external disease and cornea and uveitis and ocular inflammation (Section B in the online supplemental material 3). We also measured the accuracy rate of diagnoses made by the above three graders using a majority consensus-based approach.
Finally, we compare MOPH’s performance to commercial LLMs (ChatGPT). Assessing an LLM’s performance has always been challenging. To this end, we selected MedQA as a medical benchmark alongside SCQ in ophthalmology.20 The MedQA dataset comprises questions (compiled as SCQs) in the style of the US Medical License Exam (USMLE). We used an online translation tool (https://www.deepl.com/en/translator) to translate the MedQA questions into Chinese. For inference, we employed the default settings from ChatGLM2-6B’s GitHub (top_p=0.7, temperature=0.95).21 By default, MOPH’s model parameters are loaded with F16 precision, requiring approximately 13 GB of GPU memory. After quantisation, MOPH can be deployed locally on consumer-grade graphics cards (eg, only 6 GB of GPU memory is required at the INT8 quantisation level). We then evaluated the three LLMs (MOPH with Q8 and F16 precision, and ChatGPT) on questions from MedQA’s general medical domain and SCQ’s ophthalmology specialty.
Statistical analysis
We used the intraclass correlation coefficient (ICC) to measure inter-rater reliability. A t-test was used to compute the difference between observed means in two independent samples. Diagnosis accuracy was presented as numbers (percentages) and were compared using the χ2 test. P values were two tailed, and a p value <0.05 was considered statistically significant.
Results
We assessed the capability of MOPH in ophthalmic knowledge by preparing seven sets of mock exams, each consisting of 25 SCQs (50 scores in total) and 5 WQEs (50 scores in total). MOPH was required to complete all seven exams, while seven trainees were randomly assigned a set of exams each. On average, MOPH correctly answered 56% (range 52% (13/25) to 60% (15/25)) of SCQs, which was lower than the averages of trainees (p<0.05) (table 1 and figure 3).
However, even though we did not inform the scoring graders in advance that the language model and humans were taking the exam together in WQEs, no statistically significant difference was found between MOPH and trainees (73.4% (range 70%–82%) vs 59.5% (range 40%–88%), p=0.07). The final results showed that MOPH’s average score was close to that of trainees (64.7 (range 62–68) vs 66.2 (range 50–92), p=0.817). Notably, MOPH achieved a score of over 60 in all seven mock exams, while three out of seven trainees failed to reach the passing requirement of 60 points.
Table 2 demonstrate some examples of responses from MOPH. We found that MOPH generated high-quality general information and provided good responses following EBM. MOPH had 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4–5) (table 2a). Table 2b demonstrated the ‘moderate/potentially misinterpretable inaccuracies’ responses mainly due to the hallucinations produced by MOPH (10% (3/30), Likert scale 3). For instance, when asked ‘For patients diagnosed with type 1 diabetes before puberty, should they start screening for diabetic retinopathy after puberty?’, MOPH gave the correct response such as ‘these guidelines recommend screening for diabetic retinopathy after puberty’, however, MOPH returned the potentially misinterpretable reasons: ‘This is because before puberty, the patient’s physical development is not fully mature, and the examination of fundus lesions may cause discomfort to the patient, with a high risk’. Only 6.7% (2/30, Likert scale 1–2) responses were graded as ‘poor or very poor’ by reviewers (table 2c). As examples in table 2c, MOPH inaccurately defined that ‘The lower limit of normal vision reference value for children aged 3 to 5 years is 0.7, not 0.5’.
Table 3 illustrates the diagnostic accuracy of MOPH in the deidentified clinical vignettes. Overall, human doctors exhibited better than MOPH (accuracy 96.1% vs 81.1%, p>0.05). The performance of MOPH was still considered to be good without statistical difference. Especially in certain ophthalmic subspecialties, such as lens and cataract, MOPH attained near-human diagnostic accuracy (96.7% vs 100%, p=1.00). Conversely, in other subspecialties like retinal diseases, the performance of MOPH lagged considerably behind human doctors (63.3% vs 90%, p<0.03). The inter-rater reliability among senior graders was excellent, with ICC values of 0.95 and 0.91 for the first and third clinical scenarios, respectively.
Finally, we compare the performance of MOPH with commercial LLMs. The datasets involved seven distinct sets of 100 questions randomly selected from MedQA’s testing dataset (1273 questions in total) and above-mentioned SCQs. Figure 4 illustrates that MOPH outperforms ChatGPT on SCQ in the domain of ophthalmology, with accuracies of 57.4%, 56.3% and 49.1% for MOPH(F16), MOPH(Q8) and ChatGPT, respectively. However, on MedQA in the general medical domain, ChatGPT achieved a higher score than MOPH (44.6%, 43.4% and 53.7% for MOPH(F16), MOPH(Q8) and ChatGPT, respectively). After quantisation, the performance of MOPH slightly decreased but this difference can be considered negligible (all with p>0.05). During the above study, we observed that MOPH did not produce outputs in English when prompted in Chinese, and vice versa.
Discussion
AGI refers to systems that demonstrate broad capabilities of intelligence, including reasoning, planning and the ability to learn from experience, and with these capabilities at or above human level.22 In this paper, we developed an offline and local placement Chinese MOPH, suggesting early AGI characteristics. In the ophthalmic knowledge assessment, MOPH achieved a 65% mark, which is comparable to that of the ophthalmology trainees in a University Teaching Hospital. In answering medical questions, MOPH had 83.3% of responses following the guidelines in Chinese. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH, no statistical difference was found.
LLMs have demonstrated their effectiveness in various general domain tasks. Nevertheless, LLMs have not yet performed optimally in biomedical domain tasks due to the need for medical expertise in the responses. Additionally, since LLMs are mainly trained in English, their ability to understand and respond in languages quite distinct from English, like Chinese, hinders their effective use in Chinese contexts. Consequently, ChatGPT might face challenges in grammar, accuracy and fluency when dealing with Chinese queries, particularly in specialised fields like ophthalmology.23 China faces a wide spectrum of eye diseases that impact a considerable number of patients.24 The prevalence of eye diseases continues to rise, presenting a significant challenge to global eye health.25 26 Therefore, the demand for an ophthalmic LLM in Chinese cannot be ignored. Several studies demonstrate LLMs’ impressive multilingual capability, but their performance varies substantially across different languages.27 According to Zeng’s report, ChatGLM is a bilingual LLM that has been pretrained on over 1 trillion English and Chinese tokens.13 While it is not explicitly mentioned in the study whether there is internal knowledge transfer/translation between the two languages, it is safe to assume that ChatGLM has been trained on a diverse range of data from both languages. This means that the model has learnt to recognise and understand the nuances of both languages and can generate outputs in either language based on the input prompt.
LLMs may generate text that is semantically or syntactically plausible but is incorrect or non-sensical (known as hallucination).28 In Potapenko’s study, ChatGPT provided inadequate responses when addressing questions related to the management of retinal diseases.6 A study on ChatGPT’s response to vernal keratoconjunctivitis queries found that it provided inaccurate and potentially harmful information, especially regarding treatment and medication side effects.7 To overcome model hallucinations in medical data screening during reference data retrieval, we adopted a method combining vector database retrieval with keyword retrieval, which allows MOPH to integrate external knowledge bases and effectively reduce the LLMs’ hallucinations.29 Rather than training on public Chinese medical databases, such as Chinese Medical Knowledge Graph,14 as many previous language models did, we carefully designed ophthalmic domain fine-tuning datasets. Our preliminary results demonstrated that MOPH not only offers highly accurate general information about ophthalmology but also provides evidence-based responses regarding the treatment of diseases.
Reasoning is the ability to draw logical conclusions from given information. Some studies have shown that LLMs, such as GPT-3.5, can achieve impressive performance on some reasoning tasks, such as mathematical reasoning or logical reasoning.30 Lin et al compared GPT-3.5 with human performance in American board certification in ophthalmology. The results showed that GPT-3.5 scored 63.1%, while humans scored higher at 72.6%.31 In the study led by Mihalache et al, ChatGPT achieved an accuracy of 46%, but its performance did not meet the threshold for providing substantial assistance in board certification preparation.8 Interestingly, in both our study and previous studies, the average scores of humans are higher than those of LLMs. There are several possible explanations. First, clinical reasoning (CR) is essential for clinicians, as it is the process they use to reach a diagnosis, treatment and/or management plan. However, in the clinical domain, most LLMs mainly focus on clinical classification or reading comprehension and underexplore CR for disease diagnosis due to the expensive rationale annotation with clinicians.32 Our results indicate that MOPH may have preliminary CR abilities, although they still lag behind human doctors. Technique limitation, such as lacking multimodal capability, may be another possible reason.33 For example, MOPH has poor diagnostic abilities for retinal diseases. This may result from MOPH’s inability to process ophthalmic images in its current version.
The use of digital health data raises concerns regarding security and privacy.34 As an LLM that can operate offline and deploy locally, MOPH ensures that patients’ privacy and information are not stored or disclosed on the network, thus helping healthcare institutions strengthen the defenses for protecting patient privacy and information. Responsible usage of these AI systems in clinical practice is of utmost importance.35 Currently, MOPH serves as an assistive tool, highlighting the necessity of human supervision. MOPH delivers health information conversationally, making it easier to comprehend compared with professional guidelines.
This study has several limitations. First, one way to enhance the in-context learning ability of a model is to use few-shot prompting, which involves providing several examples in the prompt to guide the model towards better performance. We did not assess few-shot prompting’s effect on MOPH in this study. Previous research has highlighted its instability due to training example variations, order and prompt formats inconsistencies. Second, although we demonstrated the preliminary AGI capabilities of MOPH in ophthalmology through three different tasks, these cannot cover the entire clinical diagnosis and treatment spectrum, such as bedside consultation training for resident physicians, actual clinical visits and giving relevant medication and even surgical suggestions, etc. Future work would explore MOPH’s performance in various other clinical scenarios. Third, MOPH currently only focuses on textual information and cannot analyse images or videos, while ophthalmic examination results primarily rely on images. Finally, given the rapid development of new LLM models and versions, the reported results should be interpreted cautiously. For instance, recent studies have shown GPT-4’s improved performance over GPT-3.5 on medical assessments.31 Further comparisons that encompass more advanced LLMs, such as GPT-4, Gemini (Google) or Ernie-4 (Baidu), would provide stronger insights into how LLMs might facilitate clinical workflows across different countries and languages.
We have developed a Chinese-specific ophthalmic LLM, MOPH, and demonstrated its accuracy and reliability in different clinical scenarios. As an AGI committed to patient privacy and data security, we will continually evaluate MOPH’s real-world clinical performance.
Data availability statement
Data are available upon reasonable request. Data are available on reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This observational study was approved by the Xinhua Hospital Ethics Review Committee (Approval No. XHEC-D-2023-131), and the study protocol followed the tenets of the Declaration of Helsinki. The review committee indicated that patient consent was not required in this research as we only used publicly accessible or de-identified data.
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
CZ and HY contributed equally.
LC and MZ contributed equally.
Contributors Planning and design: CZ, HY, JG, JY, LC. Conduct and reporting: CZ, HY, JG, JY, JP, XX, MX and LC. Analysis and interpretation of data: CZ, HY, JG, JY, PF, YY, DH, YH, JP, XX, MX and LC. Critical revision: CZ, HY, MX and LC. Supervision and guarantor: CZ, LC, PZ and MZ.
Funding This study was supported,the National Natural Science Foundation of China (82171044), the Special Strategic Project on Innovative Science and Technology of Guangdong Province(STKJ202209073), the Hospital Funded Clinical Research, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine (21XJMR02) and Hospital Management Research Program of Institute of Hospital Development Strategy, China Hospital Development Institute, Shanghai Jiao Tong University (HDSI-2022-A-001). Interdisciplinary Program of Shanghai Jiao Tong University (Number YG2021QN52).
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer-reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.