Article Text
Abstract
Background Artificial intelligence (AI) in medical imaging diagnostics has huge potential, but human judgement is still indispensable. We propose an AI-aided teaching method that leverages generative AI to train students on many images while preserving patient privacy.
Methods A web-based course was designed using 600 synthetic ultra-widefield (UWF) retinal images to teach students to detect disease in these images. The images were generated by stable diffusion, a large generative foundation model, which we fine-tuned with 6285 real UWF images from six categories: five retinal diseases (age-related macular degeneration, glaucoma, diabetic retinopathy, retinal detachment and retinal vein occlusion) and normal. 161 trainee orthoptists took the course. They were evaluated with two tests: one consisting of UWF images and another of standard field (SF) images, which the students had not encountered in the course. Both tests contained 120 real patient images, 20 per category. The students took both tests once before and after training, with a cool-off period in between.
Results On average, students completed the course in 53 min, significantly improving their diagnostic accuracy. For UWF images, student accuracy increased from 43.6% to 74.1% (p<0.0001 by paired t-test), nearly matching the previously published state-of-the-art AI model’s accuracy of 73.3%. For SF images, student accuracy rose from 42.7% to 68.7% (p<0.0001), surpassing the state-of-the-art AI model’s 40%.
Conclusion Synthetic images can be used effectively in medical education. We also found that humans are more robust to novel situations than AI models, thus showcasing human judgement’s essential role in medical diagnosis.
- Imaging
- Retina
- Telemedicine
- Diagnostic tests/Investigation
- Public health
Data availability statement
Data are available upon reasonable request. No data are available. In this study, the synthetic images created and adopted for training purposes can be shared for scientific purposes. However, the patient images used for image generation and those used for evaluation tests cannot be shared due to the divided legal opinions on the sharing of medical images within Japan.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Artificial intelligence (AI) models can accurately detect retinal disease but still have multiple challenges for real-world implementation. Humans can generalise better but have limited training opportunities. A scalable teaching method was needed to improve human performance.
WHAT THIS STUDY ADDS
Using AI-generated synthetic images for training rapidly enhances humans' diagnostic accuracy to match state-of-the-art AI models while preserving robustness.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This teaching approach could be applied in other medical imaging fields to train experts worldwide, improving patient outcomes efficiently. It also showcases AI’s potential to empower human skills rather than replace them.
Introduction
Ageing populations worldwide are placing unprecedented pressure on healthcare systems,1–3 and the shortage of medical experts is emerging as a critical challenge that must be addressed to ensure the sustainability of healthcare delivery.4 Artificial intelligence (AI) for medical diagnosis holds potential, demonstrating high performance in detecting various diseases.5–7 However, challenges such as decreased performance on unseen imaging modalities and the commercially available AI diagnosis, which is difficult for medical insurance systems to cover,8 need to be addressed to fully harness AI’s potential and ensure its accessibility and effectiveness for patients worldwide.
The lack of large, curated imaging repositories often hinders the full human potential in medical image diagnosis.9 Medical students might only see a few examples per disease, making it difficult to recognise all cases in practice where there is a large variety in disease presentation and patient morphology. AI models, on the other hand, are typically trained on many thousands or even millions of images.
Recently, large text-to-image generative foundation models like stable diffusion (SD)10 have been developed to generate many images faster and with higher quality than generative adversarial networks,11 the previous state-of-the-art. We hypothesise that training humans on many examples can be an effective way of teaching and allowing them to match or even outperform specialist state-of-the-art AI models.
We developed an AI-assisted teaching approach that intensively enhances non-experts' training by focusing on image analysis, similar to AI model training.12 We leverage generative AI to create synthetic images to show students many examples without concerns about protecting individual medical data privacy.13 We test our hypothesis and approach to retinal disease detection in ultra-widefield (UWF) (220°)14 retinal images. We compared the identification abilities of learners who completed the task with those of AI and experienced experts. We also examined the robustness of their diagnostic capability in the face of a novel imaging modality not covered during training, namely a 50° standard field (SF) of view obtained using a completely different imaging device.
Methods
Study overview
We designed a web-based training course that uses synthetic images to teach students how to recognise disease in UWF retinal images. We conduct a test of real images to evaluate the students’ performance. Each student took the test before and after the training course to assess the effectiveness of our teaching method. Notably, the students were not given feedback when taking the test, and at least 1 week passed between taking the test before and after training. We also compared experienced experts and an AI system on the same test. Additionally, we used a second test consisting of SF retinal images to evaluate how well participants generalise to an unseen modality. We considered images belonging to one of six classes, namely standard, healthy retinas (Normal) and five critical retinal diseases: diabetic retinopathy (DR), age-related macular degeneration (AMD), glaucoma (Gla), retinal vein occlusion (RVO) and retinal detachment (RD).
Study participants: students and experts
We tested our method on 161 trainee orthoptists at different stages of a 4-year university programme. This is the entire student body of this course, and the implementation rate was 100%. The trainee orthoptist’s curriculum includes ophthalmic diseases; their knowledge in this area is expected to deepen with every year of study. Including students at different stages of their education allows us to evaluate whether our approach is practical for varying levels of pre-existing knowledge. Students were instructed by the student leader (FM) not to engage in any additional learning activities until the experiment’s conclusion.
To contextualise the students' performance, we also had eight experienced experts take the test, including five certified orthoptists and three retinal specialists who are experienced ophthalmologists in the surgical retinal diseases field, both experts at interpreting retinal images. Certified orthoptists are qualified professionals who specialise in performing ophthalmic examinations based on ophthalmic knowledge and hold a national medical qualification in Japan. Each of the five orthoptists involved in this study had over 10 years of practical experience. The three retinal specialists all held the ophthalmic specialist certification recognised by the Japan Ophthalmological Society and had more than 10 years of experience in retinal and vitreous surgery. An overview of the demographics and level of education of the participants is shown in table 1.
All students are affiliated with a 4-year university that trains certified orthoptists (a national ophthalmic examination specialty certification in Japan). The nurses in this study are a group that has not specialised in the ophthalmic field. Retina specialists are ophthalmologists who specifically specialise in retinal diseases.
Evaluation tests
To evaluate the ability of study participants to recognise disease in UWF images (Optos 200Tx, Nikon, Japan), we used a web-based test consisting of 120 images, 20 per class. The participant could choose one of seven options for each image: the target labels and an ‘I don’t know’ option. Unlike the training course, correct answers were not displayed after a student made a selection. We used a second test with a different type of retinal imaging to evaluate how well students generalise to an unseen image modality. This test is identical in design but uses SF images (optical coherence tomography-assisted SF fundus camera, Triton, Topcon, Japan) instead of UWF images. An example of each type is shown in online supplemental figure 3. Both provide a retinal picture, but there are two key differences. First, SF images only capture a narrow 50° field of view around the posterior pole, the central part of the retina, whereas UWF images capture a 220° field of view. Second, the UWF camera uses two lasers to scan the retina to produce a pseudocolour image, whereas a standard true colour optical camera has SF images. This leads to a difference in appearance and scale and different types of imaging artifacts can occur in either modality. Both SF and UWF images were selected by the ophthalmologist (HT) based on the clarity of the lesion within their respective imaging ranges, making them suitable for diagnosis. Consequently, the patients in the SF images differed from those in the UWF images. The background of each group of patients is presented in table 2.
Supplemental material
AI-based image generation
SD V.1.4,10 a large pretrained text-to-image foundation model, was fine-tuned on 6285 UWF retinal images captured with an Optos 200Tx UWF camera (Nikon, Japan) using DreamBooth15 to produce novel, synthetic UWF images. The dataset consisted of 1666 DR images, 215 AMD images, 1316 Gla images, 393 RVO images, 468 RD images and 2227 normal images. (The images used in the evaluation test are not included.) All images were selected based on diagnoses agreed upon by two or more ophthalmologists: RS, HT and TY. We fine-tuned a separate SD model for each class and generated images until 100 suitable SD images were selected. The same ophthalmologists (RS, HT and TY) assessed the generated images. They selected those suitable for teaching based on the clarity of features unique to the target pathology without additional findings suggesting other diseases. The background of each image of a patient is presented in table 2. The number of generated SD images and selection rate per class was 8000 images for DR with a rate of 1.3%, 1500 for AMD (6.7%), 1000 for Gla (10%), 500 for BRVO (20%), 1100 for RD (9.1%) and 3500 for normal (0.29%). We fine-tuned the SD model and generated the images on a Nvidia RTX3090TI 24GB GPU using Python V.3.1.0. and the PyTorch V.1.13.0 and Diffusers V.0.10.2 libraries. We have provided more details in the ‘SD optimisation’ section of the online supplemental file 1. Artificial synthetic images generated with our fine-tuned SD model are shown in figure 1.
Web-based training course using synthetic images
During the course, the student is shown an image and asked to select the class they think it belongs to (online supplemental figure 1). After responding, the correct answer is immediately displayed alongside some image annotations highlighting and explaining the image’s relevant parts (online supplemental figure 2). We generated 600 synthetic images with SD, 100 per class, and divided them into five sets that were used for teaching. Each of the five sets consists of 120 synthetic images, with 20 images of each of the six types. They are named from No. 1 to No. 5. Students completed all five sets in sequence from the No. 1 set to the No. 5 set unless their performance exceeded the accuracy of the worst-performing expert, in which case training would end early.
Verification to memorise problems by the SD model
To check that the SD model did not memorise any real images (verification to ensure that SD is not simply outputting the learning images as they are), we compared all 600 synthetic images with all 6285 real images used for fine-tuning. We used self-supervised copy detection16 to find the most similar pairs of real and synthetic images, which were manually inspected. None of these pairs were visually identical, suggesting that our SD model did not memorise images. We have provided more details in the ‘Evaluation of Similarity’ section in online supplemental file 1.
State-of-the-art AI model for comparison
As a further comparison, we evaluated a recently proposed, state-of-the-art AI model previously published by researchers specialising in machine learning, which has achieved excellent performance on internal and external test sets. This model was trained on a dataset of 5376 patients (8570 eyes, 13 026 images of Optos 200Tx) and achieved an area under the curve (AUC) of 0.9848 (±0.0004)17 for detecting disease in UWF images on the same evaluation tests provided to the students. The model produces a probabilistic output by default that needs to be mapped to discrete predictions. We used the ‘conservative’ decision threshold for detecting images with disease, as proposed by the original authors. If an image has been classified as showing disease, the disease with the highest probability is then taken as the model’s prediction.
Statistical analysis
Statistical analysis was performed using JMP V.16.2.0 (SAS Institute, Cary, NC, USA). Paired t-tests were used for within-subject significance tests regarding learning effects. We conducted a non-parametric multiple comparison test (Steel-Dwass test) when comparing study times and the time required to answer each question on the evaluation test among groups of students.
Ethics and data
The retinal images used in this study were acquired during clinical practice at Tsukazaki Hospital, Himeji, Japan. We obtained express written consent from each patient for the research use of their data. Written explanations and consent were received from the students, and prior explanations and consent were obtained from the experts. Our research was conducted in accordance with the Declaration of Helsinki and approved by the ethics committee of Tsukazaki Hospital.
Results
The performance of all study participants and the AI model is reported in online supplemental table 1.
The eight expert clinicians had an average accuracy of 91.1% (±4.2%) in UWF images. The AI model achieved an accuracy of 73.3%. The students completed the course in 53 min 1 s (±16 min 0 s) (online supplemental figure 4), and their average accuracy improved from 43.6% (±18.8%) to 74.1% (±9.3%) (p<0.0001 by paired t-test), with performance improving significantly across all subgroups (figure 2). For the trainee orthoptists, prestudying performance increased with a year of study, as expected, but even fourth-year students saw their accuracy rise by over 15%. Interestingly, fourth-year students spent the least time studying, almost half of that of first-year students (59.4 min vs 32.0 min, p<0.0001 by Steel-Dwass test), yet spent more time per image during the UWF images evaluation test (9 s vs 7 s, p=0.0002 by Steel-Dwass test).
On the second test with SF images, which was not part of the training course, students’ prestudying performance was similar to that on UWF images, with 42.7% (±18.5%). Accuracy improved to 68.7% (±11.5%) (p<0.0001 by paired t-test) after taking the test, with performance also increasing significantly in all subgroups (figure 3). However, except for fourth-year students, the poststudying performance of the students was lower than for the UWF images. The experts achieved similar performance as before (92.8% (±6.8%)), which is expected as they encounter both types of imaging regularly in their work. However, the performance of the AI model, which had only been trained on UWF images, dropped dramatically to just 40%.
The average similarity scores (ranging from 0 to 1, with 1 being the same image) between the original images and the generated SD images for each of the six types were as follows: RD 0.45 (±0.09), RVO 0.45 (±0.07), Gla 0.48 (±0.07), DR 0.45 (±0.09), AMD 0.49 (±0.07) and normal 0.47 (±0.08). The highest similarity score among all 600 generated images was 0.747 for a normal SD image, followed by 0.739 for another normal SD image, and tied for third place were DR and Gla SD images with a score of 0.763 (online supplemental figures 5-8). (Pairs of images with similarity scores from fifth to 10th place are also presented in online supplemental figures 9-4).
Discussion
Students’ performance increased significantly after taking the web-based training course and matched or exceeded the performance of a recent, state-of-the-art AI model. This is remarkable, as students studied for less than an hour on average, and the images used in the training course were entirely synthetic, AI-generated images made with a fine-tuned SD model. Another key finding of our work is that humans are far more robust to changes in the imaging device than AI models. Although students’ performance was 8.2 percentage points lower on average for the SF images, they did not experience the dramatic drop of 33.3 percentage points that the AI model experienced. This suggests that humans generalise their learning much better, highlighting the importance of keeping humans in the loop.
The study of training techniques for improving medical imaging interpretation has been extensively examined. Reports indicate enhanced X-ray diagnostic techniques by technicians in medically underserved areas such as South Africa18 and Australia,19 as well as specialised diagnostic improvements for CT colonographic data,20 endometrial tumours21 and malignant skin tumours.22 This aligns with the educational benefits we obtained from our methods. Moreover, our educational effects demonstrated easy acquisition of AI diagnostic capabilities, suggesting that past efforts in image interpretation training for humans will remain beneficial in the coming era.
Compared with past teaching methods, our current approach also excels in cost-effectiveness. The time required to improve a non-expert’s performance substantially is relatively short. In addition, a web-based, fully automated training course can be easily scaled and made accessible worldwide. While there are concerns that AI technology might take away jobs,23 24 there is a significant opportunity to use AI in a new way: to enhance human skills.
Furthermore, AI-generated images can be shared more quickly as they do not belong to any particular patient, and thus, students can be exposed to significantly more. Ethical and data privacy issues exist when using actual patient images for medical training courses. Our approach of using synthetic images generated by SD avoids these problems and could be used in many future clinical education areas. Another potential concern is that the SD model could memorise individual images, thus not preserving patient privacy.25 However, our detailed examination using state-of-the-art methods for detecting memorised images could not find such cases, implying that this approach may preserve privacy.
We found that non-experts could be trained to match or exceed the performance of the AI model in about an hour of training. However, experts still performed at least 10% better than students, and the AI model could complement humans. For example, students might sometimes miss cases that the AI model would have identified but also recognise if the model is making a mistake. Preliminary analysis suggests that there are indeed cases that are hard for humans but correctly scored by AI models and vice versa. The correlation between the mean accuracy of the students per image and whether the AI model scored the image correctly is relatively weak (Pearson r=0.1537; p=0.0936) (online supplemental figure 15). Future work could explore whether providing the students with the output of the AI model would further improve their performance and help close the performance gap with experienced experts.
There are several limitations to this study. First, there is no direct comparison with the current, widely used medical education methods that rely on textbooks with few images and lengthy explanations. In other words, the results of this study do not necessarily negate current education methods. Second, the learners in this study are limited to medical personnel. Further investigation is needed to determine whether this approach works well for learners with no medical background. Third, the study does not examine the persistence of learners’ diagnostic capabilities. While AI’s diagnostic capabilities are fixed, it is easy to imagine from the theory of the forgetting curve26 that continuous learning would be necessary to maintain learners' diagnostic capabilities. On the other hand, further learning could lead to even more significant improvements in diagnostic capabilities, so future studies are needed to determine the frequency and type of additional learning required to maintain and improve learning abilities. In addition to that limitation, the low adoption rate of 0.29% for standard eye images is due to the need to exclude many borderline cases that are even slightly close to showing signs of disease to classify an image as usual, which lacks distinctive features. This process necessitated extra effort from ophthalmologists, indicating a point for improvement in this method in the future.
Future work could further explore the potential for using AI to enhance clinical education and whether non-experts benefit from having a diagnostic AI model available for decision support. We hope that our human-centric AI work will help improve clinical education and the level of care in ophthalmology and other fields of medicine.
Data availability statement
Data are available upon reasonable request. No data are available. In this study, the synthetic images created and adopted for training purposes can be shared for scientific purposes. However, the patient images used for image generation and those used for evaluation tests cannot be shared due to the divided legal opinions on the sharing of medical images within Japan.
Ethics statements
Patient consent for publication
Ethics approval
This study involves human participants and was approved by Tsukazaki Hospital Ethics Committee (approval number: 221053). Participants gave informed consent to participate in the study before taking part.
Acknowledgments
Nobuto Ochiai and Shoji Morita provided extensive technical advice on AI for image generation
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
X @JustEngelmann, @mobernabeu
Contributors HT and YK conceived the idea of the study. RN played a central role in constructing and managing the online tests. FM was pivotal in planning, implementing and supervising student tests. TY and TN conducted the ophthalmological evaluation of the data and managed and supervised the medical aspects of the experiment. MT, MA and KK produced the artificial images and performed statistical analyses. YN, JE and MB contributed to the interpretation of the results. HT drafted the original manuscript. JE and MB supervised the conduct of this study. HT is guarantor.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
Linked Articles
- Highlights from this issue