Aims To investigate the efficacy of a bi-modality deep convolutional neural network (DCNN) framework to categorise age-related macular degeneration (AMD) and polypoidal choroidal vasculopathy (PCV) from colour fundus images and optical coherence tomography (OCT) images.
Methods A retrospective cross-sectional study was proposed of patients with AMD or PCV who came to Peking Union Medical College Hospital. Diagnoses of all patients were confirmed by two retinal experts based on diagnostic gold standard for AMD and PCV. Patients with concurrent retinal vascular diseases were excluded. Colour fundus images and spectral domain OCT images were taken from dilated eyes of patients and healthy controls, and anonymised. All images were pre-labelled into normal, dry or wet AMD or PCV. ResNet-50 models were used as the backbone and alternate machine learning models including random forest classifiers were constructed for further comparison. For human-machine comparison, the same testing data set was diagnosed by three retinal experts independently. All images from the same participant were presented only within a single partition subset.
Results On a test set of 143 fundus and OCT image pairs from 80 eyes (20 eyes per-group), the bi-modal DCNN demonstrated the best performance, with accuracy 87.4%, sensitivity 88.8% and specificity 95.6%, and a perfect agreement with diagnostic gold standard (Cohen’s κ 0.828), exceeds slightly over the best expert (Human1, Cohen’s κ 0.810). For recognising PCV, the model outperformed the best expert as well.
Conclusion A bi-modal DCNN for automated classification of AMD and PCV is accurate and promising in the realm of public health.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Age-related macular degeneration (AMD) is a disease affecting macular area of retina and causing severe visual impairment.1 It is characterised by drusen and basal laminal deposits between the retinal pigment epithelium (RPE) and Bruch’s membrane in the early stage.2 Patients in early AMD are usually asymptomatic or presented with mild central vision distortion. Late AMD includes neovascular form and atrophic form. Choroidal neovascularisation (CNV) is further classified into three subtypes, the central vision loss can deteriorate rapidly.1 Atrophic AMD progresses more stable and is less urgent.3 Clinical diagnosis of AMD is based on auxiliary examinations, as colour fundus photographs, spectral-domain optical coherence tomography (SD-OCT) and OCT angiography with higher resolution.1
Polypoidal choroidal vasculopathy (PCV) is a subtype of Type 1 CNV, the neovascularisation grows below the RPE.4 PCV is characterised by recurrent exudation, haemorrhages and aneurysmal dilations.5 The prevalence of PCV among patients with AMD is much higher in African and Asian populations (22% to 62%) than in European populations (8% to 13%).6 In both Japanese Study Group Guidelines and EVEREST criteria for PCV diagnosis, colour fundus photographs and indocyanine green angiography (ICGA) are essential for lesion detection.7 ICG is a dye having different absorption and reflection spectrum from RPE, which facilitates better visualisation of choroid and relating haemorrhage.8 It has fast first-pass effect in liver and is metabolised through kidney.8 ICGA is invasive, severe adverse effects as urticaria and anaphylactic reactions in patients have been reported.9 This study aimed to develop an artificial intelligence system to assist the non-invasive diagnosis of PCV when ICGA is not applicable.
Currently available clinical trials highly recommend initial combination therapy of photodynamic therapy and anti-vascular endothelial growth factor (anti-VEGF) therapy in some patients of PCV can significantly prolong the gap period for retreatment.10 Nevertheless, the effective strategy for neovascular AMD, anti-VEGF therapy, is less strongly proved to be efficient in PCV.4 Given the fact that the different subtypes of AMDs can affect clinician’s decision for therapy, it is necessary to make an accurate diagnosis and differentiation between PCV and other neovascular AMDs.
Machine learning has been widely used in the area of medical imaging and computer-aided diagnosis (CAD).11 Its subclassification, deep learning, has revolutionised the learning process from feature input as object segmentation and feature extraction to image input in the use of pixels. This can maintain all information on pixels as well as avoid errors by wrong feature extraction.12 The use of deep convolutional neural networks (DCNN) in analysing AMD has been applied by scientists widely.13 14 Burlina, et al first applied DCNN to achieve automated grading of AMD into two-class classification (the disease-free/early stage and the referable intermediate/advanced stage) using colour fundus images,15 then he optimised the classification system to first using four-step classification then fusing into the final two classes.16 Besides the application in AMD grading, detection of specific lesions and subtypes of AMD was studied.17–20 Such as DCNN for automated detection of exudative AMD from healthy controls using cross-section SD-OCT images,19 the detection of geographical atrophy from both healthy and other retinal diseases groups in fundus autofluorescence (FAF) images,20 and the recognition of foveal centre based on pixel-wise classification in OCT scans from AMD patients at different severity stages.18 Endeavour to technical refinement broadly expand the ability of artificial intelligence (AI), as demonstrated in previous studies for successful retinal layer segmentation,21 detection and segmentation of pigment epithelium detachment, subretinal fluid and intraretinal fluid in OCT B-scans of AMD and other retinal diseases.22 Further approaches of DCNN have been applied in decision-making23 and evaluation of anti-VEGF treatment indication for AMD.24 All studies involved only one examination method, either colour fundus photographs or SD-OCT.15 25–28 Besides, PCV is not identified in all deep learning networks for classification of AMD. The new study hereby combined these two examination tools in the aid of diagnoses of PCV and AMD.
Given a colour fundus image and an OCT image from a specific eye, the proposed deep learning model aims to recognise what types of AMD is present in the eye. As such, the model takes simultaneously the fundus image and the OCT image as a bi-modal input. The output of the model has four ways, wet AMD (excluding PCV), dry AMD (atrophic AMD and early stage AMD), PCV and normal. The primary goal of this study is to analyse and compare the performance of DCNNs with different input. The input can be fundus images alone, SD-OCT images alone or their combination. The secondary goal is to compare the performance between models and ophthalmology specialists for PCV categorisation.
This was a retrospective cross-sectional study of patients from outpatient clinic of the Peking Union Medical College Hospital (PUMCH) from January 2013 to July 2018. The inclusion criteria were diagnosis of AMD or PCV, and images of colour fundus photograph, SD-OCT and ICGA were available. Exclusion criteria were concurrent retinal vascular diseases including diabetic retinopathy, glaucoma and posterior staphylomas, which may confound the classification. Normal eyes images for control were collected from the Medical Examination Center of PUMCH, including both fundus photographs and SD-OCT images.
All patients were first evaluated by dilated fundus examinations according to the Wisconsin grading consortium (online supplementary table 1).29 Early AMD and late atrophic AMD were grouped as dry AMD. In late exudative AMD, PCV was confirmed based on the EVEREST diagnosis criteria (online supplementary table 1), with focal hyperfluorescent lesions before 6 min on ICGA and at least one of the four evidences from ICGA or orange subretinal nodule on colour photography, or associated massive submacular haemorrhage.7 While the remaining was classified as wet AMD. Example images per classification were shown in figure 1. All diagnoses were confirmed by two ophthalmology specialists independently, with disagreement further diagnosed by a senior specialist. Those predefined groups serve as the golden standard for evaluating machine learning algorithms.
Colour fundus images from dilated pupils were acquired with Topcon (Topcon Corporation, Tokyo, Japan), at a 45° covering the whole macula and optic disc. One colour-fundus image per eye was downloaded. One or more horizontal cross-section B-Scan images of the diagnosis significant lesion sites from macula region were manually selected and obtained with Topcon and Heidelberg (SPECTRALIS, Heidelberg Engineering, Heidelberg, Germany). All images were downloaded in a standard JPEG format according to the manufacturer’s instructions, they were preprocessed to remove any diagnosis information that might reveal personal information and artefacts, such as cross-section lines and scales, which may interfere with model training.
DCNNs have been the most successful approach to medical image categorisation.15 26 30 A typical DCNN architecture consists of two building blocks, a convolutional block and a task block. The convolutional block, mainly consisting of convolutional layers and pooling layers, is used to extract multilevel discriminative features from a raw input. In the task block, features are fed and converted through multiple fully connected layers into wanted output for a specific task such as classification and regression.
A bi-modal DCNN (Model DCNN-Combo) is generated. It is simultaneously fed with a fundus image and an OCT image (both with a size of 3×448×448, detailed resizing strategy in online supplementary figure 1) and accordingly gives a diagnostic result. Figure 2 depicts the schematic diagram of Model DCNN-Combo, there are a fundus branch and an OCT branch to extract a 2048-dim feature vector from each modal in parallel. They are concatenated to form a combined 4096-dim vector. The combined feature is fed into a fully connected layer followed by a softmax layer. The output of the softmax layer is a 4-dim output vector, corresponding the probabilistic scores of being one of the four categories. Detailed parameters are shown in online supplementary table 2. For comparison, we train two uni-modal DCNNs from fundus (DCNN-F) and OCT (DCNN-O), respectively. The uni-modal models' architecture is based on ResNet-50.
Deep transfer learning was implemented for better performance. First, weights from a ResNet-50 pretrained on ImageNet are transferred to two uni-modal models that take fundus images and OCT images as input, respectively. These uni-modal models are then fine-tuned on our fundus and OCT data sets, separately. The whole uni-modal models are updated during the fine-tuning procedure. Second, weights of the uni-modal models are transferred to the corresponding branches in DCNN-Combo. We focus on training the task block of DCNN-Combo, with its convolutional blocks fixed.
To evaluate the impact of the task block, we substitute a random forest classifier for the softmax layer in the bi-modal DCNN, designated as RF-Combo. In addition, we re-implemented the bi-modal method (the baseline model RF-Fixed) described in Yoo et al,30 where a VGGNet pretrained on ImageNet is used for feature extraction and a random forest classifier for classification. For a fair comparison, the VGGNet was replaced with the pretrained ResNet50, which is well recognised to be better for visual recognition.
For DCNNs training, we use cross-entropy, a standard loss function for multiclass classification. The loss is minimised by stochastic gradient descent (momentum=0.9, weight decay=1e-4). The model that obtains the best F1 score on the validation set is selected. Our DCNNs models were implemented in the PyTorch (V.1.0.0) framework.
According to the selection criteria mentioned above, a total of 1099 eyes (195 healthy, 107 dry AMD, 301 PCV, 496 wet AMD) with fundus images (online supplementary table 3) and 821 eyes (195 healthy, 62 dry AMD, 197 PCV, 367 wet AMD) with OCT images (online supplementary table 4) were enrolled in this study. A fundus and an OCT scan from the same eye are paired and labelled by diagnosis. The performance metrics is calculated based on the prediction of image pairs rather than eyes. As mentioned before, some eyes have more than one OCT images, 20 eyes were randomly selected from each class, and were constituted into 143 image pairs for test set. The same strategy was applied to validation set, obtaining 137 pairs. The rest was partitioned into training set (85% fundus, 81% OCT data). All images of a given subject participant were involved only within one single partition subset.31
To compare the classification performance of models with human, the same test set was assigned to three retinal specialists, who independently categorised each paired image into one of four classifications. Performance from both specialists and our AI system were evaluated with comparisons to the predefined gold standard classifications.
Our performance metrics included accuracy, defined by the rate of true classification cases among the total number of cases tested. Sensitivity and specificity were measured by dividing the number of true positive classifications and true negative classifications, respectively, by the total number of test cases.26 Similar to sensitivity and specificity, positive predicted value (PPV) and negative predicted value (NPV) replaced the denominator with the number of all test positives and all test negatives. Cohen’s kappa coefficient (κ) was used to measure inter-rater agreement for categorical items between models and gold-standard.32 F1 score is another measure of the test’s accuracy, which is the harmonic average of precision (ie, PPV) and recall (ie, sensitivity) in binary classification. Here in multi-classification system, we used weighted average of the F1 scores of each class to calculate the overall F1 score.33 Different classifiers trade-off between specificity and sensitivity. Therefore, receiver operating characteristic (ROC) curve was implemented to evaluate performance levels among different models in distinguishing PCV from other classifications. For human-machine comparison, the performance level of individual retinal specialist was demonstrated as an operating point on the ROC curve plot.15 26
The experiments used two single-modality models (first group) as DCNN-F and DCNN-O. Multi-modality deep learning models (second group) include DCNN-Combo, RF-Combo and RF-Fixed. The overall multi-classification performance levels for those algorithms and three retinal experts (third group) were presented in table 1. Performance levels of categorising PCV from the best models/expert in each group were further analysed in table 2 (see the confusion matrix in online supplementary figure 2).
As seen in table 1, DCNN-Combo model presented the most promising performance among these three groups, with the highest accuracy (87.4%), sensitivity (88.8%), specificity (95.6%) and perfect agreement with diagnostic gold standard (κ, 0.828), exceeding slightly over the best retinal expert (Human1, κ=0.810). Its final F1 score was 0.878. The RF-Fixed model showed the lowest performance level of accuracy (61.5%) and moderate agreement with gold standard (κ=0.472). The training loss and performance of validation data set of two uni-modal models and the bi-modal DCNN were demonstrated in online supplementary figure 3 and online supplementary table 5. The bi-modal DCNN converges fast with the pretrained parameters from uni-modal models.
In single-modality models group, DCNN-O (accuracy 83.2%, κ score 0.771) exhibited outstanding efficacy over DCNN-F (accuracy 75%, κ score 0.667) (table 1). Sensitivity of DCNN-O is more than 10 per cent higher than that of DCNN-F as well (86.0% and 75.0%, respectively). In multimodality algorithms group, the performance levels of DCNN-Combo surpassed another two models RF-Combo and RF-Fixed (accuracy=87.4%, 80.4% and 61.5%, κ=0.828, 0.734 and 0.472, respectively).
Moderate differences in performance levels were noted among three retinal specialists. The accuracy ranged from 81.1% to 86.0%, with an average accuracy of 84.1%. Substantial to near perfect agreement were met in all experts (κ=0.810, 0.799 and 0.741)(table 1).
Performance of categorising PCV in all machine models were presented as ROC curves and areas under the curve (AUC) (figure 3), operating points of three retinal specialists were marked as well. In aggregate, the performance results were comparable between our deep learning models and retinal specialists. The AUC of DCNN-Combo was 93.9%, followed by DCNN-O (AUC 93.1%) and RF-Combo (AUC 92.6%).
Table 2 showed metrics depicting the efficacy of the best model/expert in each group categorising PCV among other classifications. Our machine algorithm DCNN-Combo (κ=0.841) achieved a near perfect agreement with the gold-standard. While the best human expert achieved moderate inter-agreement (κ=0.786).
In this study, we described a multimodality deep learning machine for the multi-classifications of AMD. This model was to our knowledge the first deep learning convolutional neural network employing paired images from colour fundus and SD-OCT devices. Early work in multimodality models focussed on other deep learning approaches based on restricted Boltzmann machine and deep belief network instead of CNN.30 By applying transfer learning algorithm, we acquired updated weights of the convolutional layers efficiently from each single-modality DCNN model to construct the multimodality model. This AI system achieved the highest performance among all models and gained higher agreement with gold-standard compared with human experts. Moreover, our experiments comparing performance levels between Random Forest and Full Connection layer classifier further confirmed the superiority and necessity of every layer in DCNN framework.
Similar to strategies implemented in one previous study of AMD classification from Burlina et al,25 we first performed the aforementioned four-group classification, then fused the other three groups except PCV group, to compare different performance levels of models categorising PCV. Furthermore, to compare the performance levels of models with human, the same test set was assigned to retinal specialists as well, similar to previous AI model studies.15 16 In order to perform the human-machine comparison under the same test condition, no ICGA images of the testing data set were provided to retinal specialists either. This monitored the realistic situation where the invasive ICGA device may be unavailable, though retinal specialists’ judgement would be affected with lack of evidence of ICGA. Nevertheless, results of human-machine comparison could provide a significant assessment for the potential future clinical value of our models. Only the performance level of DCNN-Combo could be comparable with retinal specialists. This proved that although OCT imaging can provide more information of retina in three-dimensional space, it still can’t be used alone or take the place of Colour Fundus in diagnosing PCV.10 This is to date the first multimodality AI system designed for differentiating PCV and AMD. It revealed enormous potentials in clinical translations. The high safety and non-invasiveness enable it to be widely applied in community medical institutions.
The principal limitation of our study is limited data, a common situation in single-centre research. Nevertheless, we collected over 2000 images in total, which was a relatively large quantity involving PCV, and achieved a near perfect agreement with diagnostic gold standard. Based on a previous study Zhang, et al published in Cell 2018, transfer learning based model can achieve similar performance levels comparing a large data set (n=207 130, accuracy=96.6%) with limited data set (n=4000, accuracy=93.4%).26 Another limitation is the class imbalance between dry AMD group and other three groups, though inevitable due to the unnecessity to further take OCT images from clinical perspective. Nevertheless, due to the prominent image feature of dry AMD, our final test result also proved perfect performance levels of categorising dry AMD. A third limitation is different SD-OCT devices between the healthy and diseased, but all images from these devices were preprocessed to the same file format and size before training. In the future, we will focus on collecting more data from different examination devices and different retinal diseases.
This study proposed a novel bi-modality deep learning convolutional neural network framework for classification of PCV and AMD. The high accuracy of the algorithm achieved by training of more than 2000 standardised clinical images further proved that transfer learning approach and AI system could effectively learn from limited repository of medical data. Moreover, this work could accelerate the development of multimodality in AI system and more efficient screening systems for different subtypes of AMD in community medical institutions, promoting the wide application of clinical translation and medical engineering in the realm of public health.
Contributors ZX, WY and YC designed the experiments. JY, ZX, WY and YC collected and labelled samples. WW, JZ, DD and XL created and trained the AI models. ZY, DC and FH performed the human-machine comparison test. ZX, WW and XL contributed to the interpretation of the results. ZX and WW wrote the manuscript. WY, XL and YC guided revision of the manuscript.
Funding This work was supported by The Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences grant number 2018PT32029, CAMS Initiative for Innovative Medicine grant number 2018-I2M-AI-001, Pharmaceutical collaborative innovation research project of Beijing Science and Technology Commission grant number Z191100007719002, National Natural Science Foundation of China grant number 61672523, Beijing Natural Science Foundation Haidian original innovation joint fund grant number 19L2062 and Beijing Natural Science Foundation grant number 4202033.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement Data are available upon reasonable request. The data sets generated during and/or analysed during the current study are not publicly available due to the issue of copyright but are available from the corresponding author on reasonable request.