Aim To explore and evaluate an appropriate deep learning system (DLS) for the detection of 12 major fundus diseases using colour fundus photography.
Methods Diagnostic performance of a DLS was tested on the detection of normal fundus and 12 major fundus diseases including referable diabetic retinopathy, pathologic myopic retinal degeneration, retinal vein occlusion, retinitis pigmentosa, retinal detachment, wet and dry age-related macular degeneration, epiretinal membrane, macula hole, possible glaucomatous optic neuropathy, papilledema and optic nerve atrophy. The DLS was developed with 56 738 images and tested with 8176 images from one internal test set and two external test sets. The comparison with human doctors was also conducted.
Results The area under the receiver operating characteristic curves of the DLS on the internal test set and the two external test sets were 0.950 (95% CI 0.942 to 0.957) to 0.996 (95% CI 0.994 to 0.998), 0.931 (95% CI 0.923 to 0.939) to 1.000 (95% CI 0.999 to 1.000) and 0.934 (95% CI 0.929 to 0.938) to 1.000 (95% CI 0.999 to 1.000), with sensitivities of 80.4% (95% CI 79.1% to 81.6%) to 97.3% (95% CI 96.7% to 97.8%), 64.6% (95% CI 63.0% to 66.1%) to 100% (95% CI 100% to 100%) and 68.0% (95% CI 67.1% to 68.9%) to 100% (95% CI 100% to 100%), respectively, and specificities of 89.7% (95% CI 88.8% to 90.7%) to 98.1% (95%CI 97.7% to 98.6%), 78.7% (95% CI 77.4% to 80.0%) to 99.6% (95% CI 99.4% to 99.8%) and 88.1% (95% CI 87.4% to 88.7%) to 98.7% (95% CI 98.5% to 99.0%), respectively. When compared with human doctors, the DLS obtained a higher diagnostic sensitivity but lower specificity.
Conclusion The proposed DLS is effective in diagnosing normal fundus and 12 major fundus diseases, and thus has much potential for fundus diseases screening in the real world.
- diagnostic tests/investigation
Data availability statement
Data are available upon reasonable request.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Colour fundus photography (CFP) plays an important role in detecting prevalent vision-threatening fundus diseases such as diabetic retinopathy (DR), retinal vein occlusion (RVO), age-related macular degeneration (AMD) and glaucoma. According to recent epidemiological studies, approximately 79.6 million people worldwide will have glaucoma by 2020,1 while the number of people with AMD is expected to reach around 200 million.2 The prevalence of diabetes around the world will reach 592 million people by 2035,3 with one-third affected by DR.4 5 However, medical services are extremely limited worldwide. For example, in mainland China, the ophthalmic human resource at the country level was only 0.14 per thousand people according to a survey in 2014.6 This serious situation imposed a substantial burden on the large-scale screening of multiple fundus diseases for early detection.
Deep learning system (DLS)-based diagnosing and grading in ophthalmology has progressed rapidly in many conditions, including cataracts,7 8 DR,9–11 glaucoma,12 retinopathy of prematurity (ROP),13 14 AMD15 16 and macular telangiectasia type 2.17 18 However, current studies mostly focus on one or only a few (less than five) diseases.19 20 To the best of our knowledge, there are still lack of efficient DL models for multiple disease (especially more than 10) recognition using CFPs. We attribute this absence to two factors: the difficulties of establishing a large-scale multidisease data set for training and validation and the technical challenges of developing a DLS suited not only for separating abnormal and normal CFPs but also for distinguishing one disease from many others.
Recently, Son et al 21 proposed a DLS for the detection of 12 major fundus abnormalities using 12 binary classification models, which could help greatly on the detection of retinal lesions. However, for disease recognition, it still needs professional interpretation, which may bring obstacles for screening and AI-assisted diagnosis if there is no trained ophthalmologists available. Also, the application of a panel of binary classification models will take much more time and computer resources than a single multiclassification model. This paper aims to develop an automated screening DLS for multiple major fundus diseases, which could be of great significance for clinical practice in future.
The current study complied with the Declaration of Helsinki and was approved by the Ethics committee of Peking Union Medical College Hospital (Number S-K631). The review board waived the need to obtain informed patient consent because of the retrospective study design and the use of fully anonymised CFPs.
Image acquisition and data sets
The selection of diseases was decided according to their prevalence and morbidity, also taking into account their clinical potential for screening using CFPs. Hence, in addition to normal fundus images, we selected 12 major fundus diseases including nine retina diseases: referable DR, pathologic myopic (PM) retinal degeneration, RVO, retinitis pigmentosa (RP), retinal detachment (RD), wet and dry AMD, epiretinal membrane (ERM) and macula hole (MH) and three optic nerve disorders: possible glaucomatous optic neuropathy (GON), papilledema and optic nerve atrophy. The imaging diagnosis was made on standard diagnostic criteria (online supplemental eTable1). Although dry and wet AMD can be considered as the same disease of different stages,22 we still classified them into two categories considering their potential difference on treatments and prognoses.
Since there were no publicly available data sets for the detection of multiple fundus diseases, we acquired and annotated a data set for the development and internal test of the DLS. To test the generalisability of the model, we also collected CFPs from an independent tertiary medical centre forming the external test set A and three primary hospitals forming the external test set B.
A total of 56 738 CFPs taken between January 2014 and December 2018 were collected from three participating centres (Henan Provincial Peoples’ Hospital, Zhengzhou, Henan, Beijing Tongren Hospital, Beijing and Beijing Aier Intech Eye Hospital, Beijing). These images formed the development data set for the models’ training and validation.
Another 8176 CFPs were collected for the DLS testing. Among them, 3579 were from the same source of the development set and ensured that the sample size of each disease reached over 100, forming the internal test set. Another 1245 CFPs from 757 patients were collected from another independent tertiary medical centre (Peking Union Medical College Hospital) from 1 January 2019 to 30 June 2019, as the external test set A. The last 3352 CFPs from 2558 patients were collected from three primary hospitals from 4 July 2017 to 14 September 2020, as the external test set B.
For each patient enrolled, only one image of each eye could be included. The detailed inclusion and exclusion criteria are provided in the online material (online supplemental file).
After preprocessing and desensitisation, the development data set was separated into a training set and a validation set with the ratio of 4:1, according to the patients’ number, which means that the bilateral CFPs of the same patient were assigned together to either the training set or validation set. This process was organised randomly. The three test sets were maintained independently to test the performance and generalisation of the DLS.
Online annotation was carried out to label the images as normal fundus or the 12 selected diseases. A total of 17 senior board-certified ophthalmologists (with 5–12 years of experience) were randomly assigned for image annotation. Thirteen of them were assigned to label the development data set and internal test set. The other four doctors were assigned to label the external test sets. Images in the test sets were labelled three times by different ophthalmologists to obtain high reliability. Consistent labels by all three doctors were retained. If the label was only agreed by two doctors, then the final decision would be made by a fourth, more senior ophthalmologists (with over 10 years of experience). Images with no consistent labels or those annotated with poor quality, such as loss of focus, misalignment, excessive brightness or dimness, were excluded.
Development of evaluation of the DLS
The DLS was designed using the convolutional neural network (CNN) of SeResNext5023 network as a multilabel model selected from four-candidate CNNs with two parallel branches at the fully connected layer, one for the distinguish of normal and abnormalities and the other for the recognition of diseases it predicted to have, which could be more than one kind of diseases, simultaneously. The details are available in online materials (online supplemental eFigure1).
The performance of the DLS was evaluated on the three test sets. We used the area under the receiver operating characteristic (ROC) curve (AUC), sensitivity and specificity for assessments. The metrics were calculated for each label instead of each image, since one image could be annotated with more than one label. Information learnt in our automated method was visualised for further clinical review using Class Activation Map (CAM)24 which is a CNN’s visualisation technique that can identify the importance of the image regions by projecting back the weights of the classification layer on the convolutional feature maps obtained from the last convolution layer.
Comparison of the DLS with human doctors
To assess whether the DLS has reached a comparable diagnostic performance with human doctors, four ophthalmic residents were tested using the external test set B. Each of them was assigned randomly with one quarter samples of the whole set and annotated online and then compared the performance with DLS, which annotated the same images.
All statistical analyses, including ROC curves, were carried out using the programming language Python (V.2.7; Python Software Foundation; Wilmington, Delaware, USA). The results of the indicators are presented as values with 95% CIs.
A total of 64 914 CFPs were enrolled in this study with the field of 35–55 degrees of the posterior pole covering the whole area of macula and the optic disc. The DLS was trained and validated using 46 501 and 10 237 images, respectively, and evaluated on the three test sets with 3 579 images (2 635 patients with a mean age (±SD) of 55.4±18.3 ranging from 2 to 96), 1 245 images (757 patients with a mean age (±SD) of 48.7±18.0 ranging from 4 to 89) and 3 352 images (2 558 patients with a mean age (±SD) of 52.6±20.6 ranging from 3 to 97), respectively. The numbers of images in each category of the internal test set were all over 100, which ensured the reliability of the test results. The two external test sets represented a real clinical scenario and the disease distribution of both tertiary medical centre and primary hospitals in China over a certain period of time (table 1). CFPs with more than one label in the training, validation internal test set, external test sets A and B were 3 202 (6.9%), 488 (4.8%), 334 (9.3%), 70 (5.6%) and 217 (6.5%), respectively.
The model performance on the test sets
We developed a late-fusion multilabel model as well as 12 binary classification models for comparison, and the former achieved a higher mean average precision on validation set with statistical significance (p=0.020) (online supplemental eTables 2 and 3). The ROC curves were also listed online (online supplemental eFigure 2 and 3). We, therefore, selected the late-fusion multilabel model for testing. The threshold of the model on validation set was listed in online materiel (online supplemental eTable 4). The AUCs in the internal test set and the two external test sets were 0.950 (95% CI 0.942 to 0.957) to 0.996 (95% CI 0.994 to 0.998), 0.931 (95% CI 0.923 to 0.939) to 1.000 (95% CI 0.999 to 1.000) and 0.934 (95% CI 0.929 to 0.938) to 1.000 (95% CI 0.999 to 1.000), with corresponding sensitivities of 80.4% (95% CI 79.1% to 81.6%) to 97.3% (95% CI 96.7% to 97.8%), 64.6% (95% CI 63.0% to 66.1%) to 100% (95% CI 100% to 100%) and 68.0% (95% CI 67.1% to 68.9%) to 100% (95% CI 100% to 100%), and corresponding specificities of 89.7% (95% CI 88.8% to 90.7%) to 98.1% (95% CI 97.7% to 98.6%), 78.7% (95% CI 77.4% to 80.0%) to 99.6% (95% CI 99.4% to 99.8%) and 88.1% (95% CI 87.4% to 88.7%) to 98.7% (95% CI 98.5% to 99.0%), respectively. For the major blindness leading diseases, the AUCs of referable DR, possible GON, dry and wet form AMD in the external test sets were 0.965 (95% CI 0.960 to 0.971) to 0.986 (95% CI 0.984 to 0.988), 0.931 (95% CI 0.923 to 0.939) to 0.946 (95% CI 0.942 to 0.950) and 0.968 (95% CI 0.964 to 0.971) to 0.988 (95% CI 0.986 to 0.990), respectively. Table 2 shows the results of the AUC, sensitivity and specificity, of the DLS tested on the three test sets. The ROC curves of the DLS tested in the internal set were as figure 1 shows. Other ROC results tested in the external sets are listed in the online material (online supplemental eFigure 4 and 5).
To further understand the model’s performance, we used heat maps for visualisation and clinical review. Figure 2 shows heat maps of the true-positive reports of normal fundus and 12 fundus diseases on the external test sets. Different colours mark subregions with different degrees of activation of the DLS, which increase progressively from blue to red as indicated by the colour bar. The heat maps indicate that the features extracted by the model generally present a high consistency with human doctors’ diagnostic basis in real clinical work according to the specific lesions on CFPs. Some false-positive and false-negative cases indicated that the DLS seemed to miss some fine abnormalities like the change of the disc rim, optic disc pit in possible GON or small MH (figure 3).
We also noticed that the model achieved a relatively lower sensitivity on the detection of possible GON. To further interpret and prove the model’s performance, we compared our DLS with some other specialised GON detecting models using public available data set. The test was performed on Retinal Fundus Glaucoma Challenge, REFUGE (https://refuge.grand-challenge.org) test set, which contains 400 fundus images with 360 normal fundus and 40 glaucoma. We achieved 0.955 AUC and 0.931 reference sensitivity, which rank six and four among all the 12 participating team, that is comparable to the state-of-the-art models (reference sensitivity: 0.725 to 0.976, AUC: 0.846 to 0.989).25 The detailed comparison results were available in online materiel (online supplemental eTable 5 and eFigure 4).
The comparison between human doctors and the DLS model
The mean sensitivity and specificity of the four human doctors were 69.5%, 75.7%, 74.0% and 71.1%, and 98.1%, 97.8%, 97.8% and 97.6%, respectively. The corresponding DLS model’s sensitivity and specificity were 90.2%, 86.8%, 84.0% and 82.4%, and 97.6%, 92.6%, 93.7% and 93.6%, respectively. Statistical analysis (Mann-Whitney U test) showed that the DLS achieves significant higher sensitivity comparing with two of the four doctors and lower specificity comparing with all four doctors. Detailed results are available in online materials (online supplemental eTable 6).
DL models for the detection of multiple fundus diseases
Previous studies have reported a large number of DLSs used for multiclassification, such as the detection of several diseases or severity of DR and AMD using CFPs or optical coherence topography.9 16 26 There have also been studies focused on the detection of multiple fundus lesions recently.21 The detection of certain fundus diseases using DLS exceeding 10 categories remains very rare. Choi et al 27 described automated differentiation between normal fundus and nine retinal diseases but achieved an accuracy of only 36.7% for all 10 classes. Comparing with their study, our work was carried out using a large data set with over 60 000 images acquired from real clinical patients. The DLS developed by Son et al 21 proposed a deep learning method for detecting multiple lesion-level abnormalities in colour fundus images. The strength to their study is that the detected lesions provide a more intuitive interpretation than holistic predictions as made by the prior art. However, as there lacks a one-to-one correspondence between lesions and fundus diseases, a gap naturally exists when converting lesion-level findings to diseases, which is left untouched by Son et al in this work, we take a orthogonal direction, making a novel attempt to directly recognise 12 fundus diseases from a given colour fundus image. Moreover, we adopt the CAM technique to visualise which part of the given image is responsible for the final prediction.
Furthermore, the diseases selected in this study mostly comprise leading causes of blindness that need early detection and intervention covering a broad spectrum including retinal vascular diseases (RVO, referable DR), retinal degeneration diseases (PM retinal degeneration, RP, RD), macular disease (ERM, AMD and MH) and optic nerve disorders (possible GON, papilledema and optic nerve atrophy). Most of them have rarely been reported in previous studies.
Development and selection of the models
The models developed for multidisease detection were diverse in previous studies. The scenario targeted most often by machine learning methods for applications in ophthalmology is image classification,28 which is typically used in retinal analysis for automatic screening. Multiclass classification is used28 to detect the type of disease present or to accurately determine the stage of disease. This has been done for DR10 11 and ROP.29 30 In the case of multiclass classification, images belong to only one of the mutually exclusive categories. Choi et al 27 reported a multidisease recognition model that applied a method of classification to classify fundus images into different categories of retinal diseases for diagnosis. The authors attributed part of the dissatisfactory performance of the model to decreased expected accuracy as the number of categories multiplied, which has been demonstrated in previous studies.31 However, mutually exclusive multiclassification model may not be unsuitable for multiple disease recognition since some fundus diseases may coexist. For example, patients could have DR and ERM simultaneously,32 and the incidence rate of open-angle glaucoma in patients with RVO is significantly higher than that in the general population.33 Our multilabel model was developed with the modified feature layer of SeResNext50 in order to simultaneously classify abnormal versus normal CFP images and to accurately detect the presence of multiple diseases. We combined the two steps into a single model to simplify implementation in future clinical practice.
The data sets and the model’s performance
Our model was trained and tested in real clinical data sets, and this was an important feature of the study, mimicking real screening scenarios as closely as possible at this early stage of development. To assure the accuracy, diversity and reliability of the data sets, we used CFPs from real-life data sets from three different clinical centres that were annotated by 17 experienced ophthalmologists. The amount of work involved in annotating the images was formidable, and this data set was much larger than in previous studies on multidisease classification with only 279 images.27 To our knowledge, this is also the largest multidisease recognition data set thus far.
Considering the future application scenarios of the model is screening especially in lower level medical places, which may be accompanied with more complex conditions and interferences while screening, we provided two external test sets from tertiary medical centre and primary hospitals, respectively. The results showed that the disease distribution was different from that of tertiary hospital. For example, the proportion of dry AMD and possible GON was much higher. Even so, the results still supported, that the DLS could do well in both scenarios, which proved the possibility of large-scale screening in the future work.
Notice that for glaucoma detection, the sensitivity of our DLS varies, which is 0.913, 0.797 and 0.646 on REFUGE, the external test set B and the external test set A, respectively. We attribute this variation to the distinct sources of the three test sets. REFUGE, as a public benchmark data set, tends to include images of less ambiguity to ensure the reliability of its ground truth. Indeed, we observed that images from this data set are typical with respect to glaucoma. Recall that the external test sets B and A were collected from primary hospitals and tertiary hospitals, respectively. Given the common practice of a referral medical system, where cases that are less typical and thus more difficult to diagnose are to be referred from a primary hospital to a tertiary hospital, it is fair to claim that images from A were the most challenging. The increasing difficulty in glaucoma diagnosis from REFUGE to the test set B and to the test set A explains the decreasing sensitivity of the DLS to detect this condition.
The interpretation of the heat maps
The ‘black box’ problem of DLS has greatly limited its application and acceptance in real clinical practice. In this study, we used heat maps for visualisation. As the heat maps indicated, the features extracted by the model for prediction are very similar to human doctors’ considerations. Taking referable DR as an example (figure 2B), the model precisely extracted the appropriate retinal lesions (intraretinal and preretinal haemorrhages) and provided a correct prediction. The heatmaps are also helpful on understanding the false results. For example, the heatmap indicated that in false-negative case of possible GON (figure 3 A2), the model paid almost no attention on the optic disc and failed to give the correct answer. The DLS model presented a limited performance on the detection of specific diseases like possible GON. To further interpret the results, we tested the model in a public available REFUGE dataset and proved that our DLS model presented a comparable performance with some of the other specialised GON detecting models. We attribute this variation to the distinct source of the test sets. REFUGE as a public benchmark dataset tends to include images of less ambiguity to ensure the reliability of its ground truth. Indeed, we observed that images from this dataset are typical with respect to glaucoma. Recall that the external test set A and B were collected from primary hospitals and tertiary hospitals respectively. Given the common practice of a referal medical system, whare cases that are less typical and thus more difficult to diagnose are to be referred from a primary hopital to a tertiary hospital, it is fair to claim that images from A were the most challenging.
Limitations and future works
Our work has some limitations. First, while we have spent much efforts to expand our external test sets, the testing sample sizes for MH and RD, which are 19 and 15 in total, remain relatively small, as compared with the other conditions. To improve the reliability of the detection performance of the two diseases, more test samples need to be collected for future exploration. Second, the external evaluation on a clinical data set collected from tertiary hospitals (external test set A) shows that our DLS detects glaucoma with a relatively lower sensitivity of 0.646. Given that glaucoma is a major blinding disease, much work remains to be done for real-world deployment. Third, some diseases included in this study initiate from the peripheral retinal area such as RP and RD, but most of the images we used for analysis were centred by the macula fovea with the maximal field of 55 degree. Therefore, the detection of these diseases may be limited. With the future common use of ultrawide fundus camera, DLS model for this kind of CFP is of high research value. Finally, future prospective trials are needed to assess the DLS in multiple independent real clinical scenarios.
The proposed DLS showed well performance on the three test sets for the detection of normal fundus as well as 12 major fundus diseases. The application of this model may alleviate the workloads of trained specialists and provide an efficient, low-cost approach for preliminary screening in places with scarce medical resources and ophthalmologists. Further acquisition of data to broaden the extent of screening for more fundus diseases will be the next step of our work.
Data availability statement
Data are available upon reasonable request.
Patient consent for publication
The authors thank Di Gong, Hong Du, Ning Chen, Dongmei Huo, Nan Chen, Hongling Chen, Donghui Li, Meiyan Zhu, Yanting Wang, Xiao Chen, Hui Liu, Huan Chen and Tong Zhao for their valuable contribution to this research. They devoted considerable time and effort to this work during the process of online annotation that lasted for more than 8 months.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
WY and YC contributed equally.
Contributors BL contributed to the statistical analysis, drafting and revising of the manuscript. HC, BZ and MY contributed to the standard operating procedure and quality control of the datasets. XJ, BL, JX and WG contributed to the acquisition of the color fundus photograph of the datasets. DCSW contributed to the revision of the manuscript. XH and HW contributed to the models’ developing, statistical analysis and preparing of the figures for the work. XL and DD contributed to the development of the models and interpretation of data, and revision of the manuscript for this study. YC and WY contributed to the conception and design of the work, revision of the manuscript and will final approval of the version to be published.
Funding CAMS Initiative for Innovative Medicine (CAMS-12M)(2018-I2M-AI-001). Pharmaceutical collaborative innovation research project of Beijing Science and Technology Commission (Z191100007719002). Beijing Natural Science Foundation Haidian original innovation joint fund (19L2062). Natural Science Foundation of Beijing Municipality 4202033. The priming scientific research foundation for the junior researcher in Beijing Tongren Hospital, Capital Medical University (2018-YJJ-ZZL-052).
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.