Background/Aims To develop a deep learning system for automated glaucomatous optic neuropathy (GON) detection using ultra-widefield fundus (UWF) images.
Methods We trained, validated and externally evaluated a deep learning system for GON detection based on 22 972 UWF images from 10 590 subjects that were collected at 4 different institutions in China and Japan. The InceptionResNetV2 neural network architecture was used to develop the system. The area under the receiver operating characteristic curve (AUC), sensitivity and specificity were used to assess the performance of detecting GON by the system. The data set from the Zhongshan Ophthalmic Center (ZOC) was selected to compare the performance of the system to that of ophthalmologists who mainly conducted UWF image analysis in clinics.
Results The system for GON detection achieved AUCs of 0.983–0.999 with sensitivities of 97.5–98.2% and specificities of 94.3–98.4% in four independent data sets. The most common reasons for false-negative results were confounding optic disc characteristics caused by high myopia or pathological myopia (n=39 (53%)). The leading cause for false-positive results was having other fundus lesions (n=401 (96%)). The performance of the system in the ZOC data set was comparable to that of an experienced ophthalmologist (p>0.05).
Conclusion Our deep learning system can accurately detect GON from UWF images in an automated fashion. It may be used as a screening tool to improve the accessibility of screening and promote the early diagnosis and management of glaucoma.
- Diagnostic tests/Investigation
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Glaucoma, characterised by optic disc cupping and visual field impairment, is the leading cause of irreversible blindness, affecting more than 70 million individuals worldwide.1–3 Due to population growth and global ageing, the number of patients with glaucoma is expected to increase over 112 million by 2040.3 Most vision loss caused by glaucoma may be avoidable if early diagnosis and timely treatment are available. However, detecting glaucoma at an early stage, especially for primary open-angle glaucoma and normal-pressure glaucoma, is very challenging. First, patients with glaucoma are often asymptomatic, and is frequently undiagnosed until very late stages when central visual acuity is compromised.1 2 Second, current glaucoma screening is mainly based on optic nerve examination by professionals through ophthalmoscopy or fundus images.4–6 Manual assessment of optic discs is time-consuming and labour-intensive, and difficult to apply for large-scale screenings.
Deep learning-based artificial intelligence (AI) has shown great promise in healthcare systems for improving population health, due to its great efficiency in the automated detection of pathological lesions and making diagnosis.7–11 Previous studies have reported high-accuracy deep learning systems for automated glaucoma detection based on traditional fundus images (30- to 60-degree visible scope of the retina).12–16 Ultra-widefield fundus (UWF) images, providing a 200-degree panoramic view of the retina,17 18 have been used to establish an intelligent system to screen for lattice degeneration, retinal breaks, retinal detachment and retinal haemorrhages,19–21 whereas the automated glaucoma screening from UWF images has not been well investigated. Under this circumstance, individuals may need to undergo an intelligent UWF imaging for retinal lesion screening and undergo another intelligent traditional fundus imaging for glaucoma screening, which makes this screening process inconvenient, inefficient and more expensive. UWF images have been used by a glaucoma specialist for glaucoma diagnosis because they have high reproducibility in evaluating the vertical cup-to-disc ratio (VCDR) and agreement with stereoscopic optic disc images,22 which provide possibility that UWF images may be applied to automated glaucoma detection by deep learning. This may provide the best use of UWF images to detect both retinal lesions and glaucoma.
To explore this unknown question, in this study, we aimed to develop a deep learning system for the automated detection of glaucomatous optic neuropathy (GON) from UWF images and evaluated this system in a large, multi-ethnic population. In addition, we compared the GON detection performance of this system to that of ophthalmologists who were experienced in UWF image analysis in clinics.
A total of 5915 digital UWF images (3417 subjects) collected from the Chinese Medical Alliance for Artificial Intelligence (CMAAI) between November 2016 and February 2019 were used to develop a deep learning system for GON detection. CMAAI is composed of medical organisations, computer science research groups and related enterprises in the AI field with the purpose of improving the research and translational applications of AI in medicine. The CMAAI data set includes subjects who underwent fundus lesion examinations, ophthalmology consultations and routine ophthalmic health evaluations. UWF images were captured without mydriasis using an Optos nonmydriatic camera (Optos Daytona, Dunfermline, UK) with 200-degree fields of view.
Three additional data sets including 17 057 UWF images obtained at three other centres in two countries (China and Japan) were used to externally test the system. One was acquired from the outpatient clinics at ZOC in Guangzhou (south-east of China), consisting of 1317 UWF images from 698 subjects; one was acquired from the outpatient clinics and health screening centre at Xudong Ophthalmic Hospital (XOH) in Inner Mongolia (north-west of China), consisting of 2693 UWF images from 1086 subjects; and the remaining one was acquired from Tsukazaki Optos Public Project (TOPP) in Himeji (west of Japan), consisting of 13 047 UWF images from 5389 subjects.
All the UWF images were deidentified before being transferred to research investigators. This study was approved by the Ethics Committee of Zhongshan Ophthalmic Center (ZOC) (identifier, 2019KYPJ107) and performed in accordance with the principles of the Declaration of Helsinki.
Image labelling and reference standard
All UWF images were classified into two categories, referable GON (including suspected and certain GON) and non-GON. For screening purposes, the referable GON was determined when achieving any of the following criteria: (1) VCDR ≥0.7, (2) rim width ≤0.1 of disc diameter, (3) retinal nerve fibre layer defect or localised notches or (4) disc splinter haemorrhages.12–16 Quality control was conducted for all images. Poor-quality and unreadable images for GON detection were defined when vessels within a one-disc diameter of the optic disc margin could not be identified, and were removed before training the deep learning system. Non-GON refers to other fundus conditions except referable GON, such as normal fundus, retinal haemorrhages and retinal exudates.
Three board-certified glaucoma specialists who had 5, 7 and 8 years of experience, respectively, were recruited to label UWF images. An accurate reference standard is required for training a deep learning system.23 The reference standard for the label of a UWF image was determined when the same annotation was achieved among all three glaucoma specialists. For disputed images, arbitration was performed by another senior glaucoma specialist with over 20 years of experience. The performance of the deep learning system in GON detection was compared to this reference standard.
Image preprocessing and augmentation
Image standardisation was initially performed before deep learning. The pixel values of the UWF images were scaled to a range of 0–1, and the size of the UWF images was resampled to a resolution of 512×512 pixels. Data augmentation was adopted to increase the diversity of a training data set, reducing the chance of overfitting during a process of deep learning. The training data set was augmented to fivefold of the original size via a combination of the random horizontal and vertical flips, random rotations up to 90° around the image centre and random brightness shifts within the range of 0.8–1.6. A total of 18 560 UWF images were used as training data.
Development and evaluation of the deep learning classification model
The flow chart of our deep learning model development and evaluation is shown in figure 1. The UWF images from the CMAAI data set were randomly assigned to the training set, validation set, and test set with a ratio of 7:1.5:1.5. No subjects overlapped among these sets. The training set was used to optimise the parameters of the deep learning model, the validation set was used to guide the selection of hyperparameters and the test set was used to evaluate the selected model. Three external data sets (ZOC, XOH and TOPP) were used to further assess the effectiveness of the selected model. The reference standard of these three data sets was the same as the CMAAI data set.
The deep learning model was trained in TensorFlow using a state-of-the-art deep convolutional neural network (CNN) architecture, InceptionResNetV2, which mimics the architectural features of two previous CNNs (the Inception Network and the Residual Network).24 CNN architectures were initialised by the weights pretrained for ImageNet classification.25
The model was trained up to 180 epochs. In the training process, validation loss was assessed using the validation set after each epoch and performed as a reference for model choice. The training process was ceased when the validation loss was not improved over 60 consecutive epochs. The model state with the lowest loss was determined as the final state of the model.
Visualisation of image features
We leveraged the saliency map technique to visualise pivotal regions that influenced our deep learning system when detecting GON. This technique calculates the gradient of the output of the CNN with regard to each pixel in the UWF image to discern the pixels with the greatest impact on the final outcomes. The intensity value of the heatmap is a direct denotation of the pixels’ impact on the deep learning system’s classification. Using this approach, the heatmap traces back to a specific location in the UWF image to highlight features that positively contributed to the classification. The effectiveness of the heatmap was determined by a senior glaucoma specialist based on whether the highlighted regions were colocalised with the lesion regions of the GON.
Characteristics of classification errors
In a post hoc analysis, a senior glaucoma specialist reviewed all misclassified UWF images made by the deep learning system. The possible reasons for the misclassification were analysed and recorded according to the observed characteristics in the images.
Deep learning versus ophthalmologists
To evaluate our deep learning system in the context of GON screening, we recruited two ophthalmologists who had 3 and 6 years of experience in UWF image analysis. The ZOC data set was employed to compare the performance of the system with the ophthalmologists. Notably, to reflect the level of the ophthalmologists in normal clinical practices, they were not informed that they competed against the deep learning system to avoid bias from competition.
To determine the performance of the deep learning system, the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV) of GON detection were calculated according to the reference standard. The 95% CIs were estimated for all the performance metrics. Unweighted Cohen’s kappa coefficients were applied to compare the results of the system to the reference standard. The Kappa result was interpreted as follows: values ≤0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial and 0.81–1.00 as almost perfect.26 The differences in the sensitivities, specificities and accuracies between the system versus ophthalmologists were calculated using the McNemar test. Statistical significance was set at two-sided p<0.05. All statistical analyses were performed using Python 3.7.3 (Wilmington, Delaware, USA).
Characteristics of the data sets
In total, 22 972 UWF images from 10 590 subjects (Han ethnicity (China): 38.8%, Mongol ethnicity (China): 10.3%, and Yamato ethnicity (Japan): 50.9%) were used to develop and externally verify the performance of the deep learning system. The demographics and characteristics of all the data sets (from the CMAAI, ZOC, XOH and TOPP) are summarised in table 1.
Deep learning system performance of GON detection
For GON detection from UWF images, our deep learning system achieved AUCs of 0.999 (95% CI 0.998 to 1.000), 0.983 (95% CI 0.977 to 0.988), 0.990 (95% CI 0.987 to 0.0.993) and 0.990 (95% CI 0.988 to 0.992) in the CMAAI test set, ZOC data set, XOH data set and TOPP data set, respectively (figure 2). Detailed information about the system’s performance in each data set, encompassing the sensitivity, specificity, accuracy, PPV and NPV, is shown in table 2. Compared to the reference standard of the CMAAI test set, ZOC data set, XOH data set and TOPP data set, the unweighted Cohen’s kappa coefficients of the system were 0.932 (95% CI 0.897 to 0.967), 0.845 (95% CI 0.808 to 0.882), 0.845 (95% CI 0.812 to 0.877) and 0.928 (95% CI 0.921 to 0.936), respectively.
Rationale of the deep learning system in GON detection
To explore the rationale of the deep learning system in detecting GON from UWF images, the CNN network was visualised using saliency maps. We found that GON regions of the images were effectively highlighted by heatmaps, which located the areas that contributed most to the final outcomes. Typical examples of heatmaps for GON images are shown in online supplemental figure S1.
False-negative and false-positive findings
The proportion of reasons for misclassed images by the deep learning system is shown in figure 3A. In the CMAAI test set, ZOC data set, XOH data set and TOPP data set, a total of 74 GON images (2.1%) were misclassified into the non-GON group (false-negative classification), among which 39 images (52.7%) showed GON with pathological or high myopia, 9 images (12.2%) showed GON with diabetic retinopathy, 3 images (4.1%) showed GON with pigmented optic disc and the remaining 23 images (31.1%) showed GON without other fundus conditions. Typical examples of false-negative images are shown in figure 3B. In contrast, a total of 424 non-GON images (3.0%) were erroneously assigned to the GON group (false-positive classification), among which 153 images (36.1%) indicated physiological large cupping, 60 images (14.2%) indicated pathological or high myopia, 7 images (1.7%) indicated non-glaucomatous optic atrophy, 64 images (15.1%) indicated age-related macular degeneration, 41 images (9.7%) indicated diabetic retinopathy, 51 images (12.0%) indicated retinal vein occlusion, 5 images (1.2%) indicated retinal artery occlusion, 20 images (4.7%) indicated retinal detachment, 3 images (0.7%) indicated myelinated retinal nerve fibre layer around the optic disc and the remaining 20 images (4.7%) indicated a normal fundus. Typical examples of false-positive images are shown in figure 3C.
Comparison of the deep learning system against ophthalmologists
In the ZOC data set, for detecting GON from UWF images, the ophthalmologist with 6 years of experience achieved a sensitivity of 96.6% (95% CI 94.3% to 98.9%) and specificity of 93.2% (95% CI 91.7% to 94.7%), and the ophthalmologist with 3 years of experience achieved a sensitivity of 80.9% (95% CI 75.9% to 85.9%) and specificity of 88.2% (95% CI 86.3% to 90.1%), while the deep learning system achieved a sensitivity of 97.9% (95% CI 96.1% to 99.7%) and specificity of 94.3% (95% CI 92.9% to 95.7%), with an AUC of 0.983 (95% CI 0.977% to 0.988%) (online supplemental figure S2). The performance of our system is comparable to that of the ophthalmologist with 6 years of experience and significantly better than that of the ophthalmologist with 3 years of experience (online supplemental table S1).
In this study, we established a deep learning system for automated GON detection based on 22 972 UWF images. The system could identify GON from UWF images with high accuracy. Even in the external data sets (collected by different types of cameras) from subjects with a variety of ethnic backgrounds in two countries, our system consistently performed with high sensitivity and specificity, demonstrating the broad generalisability of our system in the clinic. Moreover, the agreement between the outcomes of the system and the reference standard is high in each data set according to the unweighted Cohen’s kappa coefficients (all over 0.84), further confirming the effectiveness of our system.
Recently, based on traditional fundus images (30–60° of retina visible scope), several studies using AI for glaucoma detection have been published.15 27 Liu et al 15 established a deep learning system for discerning GON, which achieved an AUC of 0.996, with a sensitivity of 96.2% and a specificity of 97.7%. Hemelings et al 27 applied deep learning to computer-assisted glaucoma identification, reporting an AUC of 0.995, with a sensitivity and specificity of 98.0% and 91%, respectively. The present study showed that deep learning of GON detection based on UWF images (200° of retina visible scope) achieved an average AUC of 0.991, which was close to the AUC reported in previous studies based on traditional fundus images.15 27 Since we have proven that peripheral fundus lesions can be accurately detected from UWF images by deep learning,20 this result indicated that UWF images can also be used for automated central fundus lesion detection. Therefore, we can leverage deep learning to simultaneously screen for both central and peripheral fundus lesions from UWF images, which is more convenient and efficient than that of traditional fundus images, which cannot be used for the automated detection of peripheral fundus lesions.
Compared to the ophthalmologists experienced in UWF image analysis in clinics, the performance of our system for GON detection was equal to that of the ophthalmologist with 6 years of experience and significantly better than that of the ophthalmologist with 3 years of experience. Based on the robust performance for GON detection, our system can be applied in regions with a shortage of glaucoma specialists to increase the accessibility of glaucoma screening for high-risk populations, such as older people and people with a family history of glaucoma. In addition, our system, as an efficient first reading/screening tool, can reduce the burden of glaucoma specialists who work in large hospitals with numerous patients by focusing on patients with positive findings.
To understand how the system distinguishes GON images from non-GON images, we created heatmaps to visualise the learning procedure. The areas of the optic disc that contributed most to GON detection by the system were highlighted in heatmaps, which denoted that the system focused on the optic disc when performing GON detection. This reasonable decision-making rationale may promote the application of our system in real-world settings.
Although our system has reliable performance for GON detection, misclassification still exists. Approximately 53% of the false-negative misclassified images resulted from confounding optic disc characteristics caused by high myopia or pathological myopia. These optic discs often have features such as peripapillary atrophy (β-zone), shallow cups and optic disc tilting. The majority of the remaining false-negative images (31%) indicated GON without other fundus conditions. To increase the sensitivity for GON screening, more studies are needed to investigate and minimise these false-negative classifications. Most false-positive misclassified images (36%) were due to physiological large cupping which had a similar appearance to GON. For the other false-positive images, 6% represented normal fundus and myelinated retinal nerve fibre layers, and the remaining were a result of various fundus lesions. Notably, a referral is also needed in the cases with physiological large cupping and fundus lesions in the false-positive group. Therefore, the additional workload resulting from the false-positive images appears to be acceptable, as most of these cases would benefit from further clinical evaluation. In addition, a multimodal deep learning system that is developed based on multiple clinical data (eg, cup-to-disc ratio difference between two eyes, intraocular pressure, and visual field) may help to reduce both false-negative and false-positive results in GON detection.
This study has several limitations. First, we developed a deep learning system using two-dimensional UWF images lacking stereoscopic qualities instead of three-dimensional images. This might be a primary reason why our system has limited capability to differentiate GON from non-glaucomatous optic atrophy. In addition, the cost of UWF imaging is often higher than that of traditional fundus imaging. However, we established this intelligent GON detection system as one part of the UWF image-based comprehensive fundus lesion screening system, which could simultaneously detect multiple central and peripheral fundus lesions. On this account, the higher cost of UWF imaging might be acceptable.
In summary, we found that the deep learning system that was trained on UWF images had high sensitivity and specificity for GON detection, which was comparable to those of an experienced ophthalmologist. Hence, our system can promote early GON detection in large-scale screenings and improve the prognosis of glaucoma. A prospective multicentre study is needed to further validate the real-world performance of our system in various clinical settings.
We thank the following ophthalmologists of Tsukazaki Hospital (Japan) for sharing the Optos images: Daisuke Nagasato, Shunsuke Nakakura, Masahiro Kameoka, Hitoshi Tabuchi, Ryota Aoki, Takahiro Sogawa, Shinji Matsuba, Hirotaka Tanabe, Toshihiko Nagasawa, Yuki Yoshizumi, Tomoaki Sonobe and Tomofusa Yamauchi.
ZL, CG, and DL contributed equally
Correction notice This article has been updated since it was published online. We have added the following information: ‘ZL, CG, and DL contributed equally’.
Contributors Conception and design: ZL, CG, DL and HL. Funding obtainment: HL. Provision of study data: HL, DN and PZ. Collection and assembly of data: ZL, DL, XZ, DW, MD, FX, PY, JW and PZ. Data analysis and interpretation: ZL, CG, DN, YZ, CC, XZ, DW, MD, FX, HL, DL, CJ, YH, PY, LZ and YH. Manuscript writing: all authors. Final approval of the manuscript: all authors.
Funding This study received funding from the National Key R&D Program of China (grant no. 2018YFC0116500), the National Natural Science Foundation of China (grant no. 81770967), the National Natural Science Fund for Distinguished Young Scholars (grant no. 81822010), the Science and Technology Planning Projects of Guangdong Province (grant no. 2018B010109008) and the Key Research Plan for the National Natural Science Foundation of China in Cultivation Project (grant no. 91846109). The sponsors and funding organisations had no role in the design or conduct of this research.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement The data sets generated and/or analysed during the current study are available upon reasonable request from the corresponding author. Correspondence and requests for data materials should be addressed to HL ( ).
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.