Artificial intelligence (AI) based on deep learning (DL) has sparked tremendous global interest in recent years. DL has been widely adopted in image recognition, speech recognition and natural language processing, but is only beginning to impact on healthcare. In ophthalmology, DL has been applied to fundus photographs, optical coherence tomography and visual fields, achieving robust classification performance in the detection of diabetic retinopathy and retinopathy of prematurity, the glaucoma-like disc, macular oedema and age-related macular degeneration. DL in ocular imaging may be used in conjunction with telemedicine as a possible solution to screen, diagnose and monitor major eye diseases for patients in primary care and community settings. Nonetheless, there are also potential challenges with DL application in ophthalmology, including clinical and technical challenges, explainability of the algorithm results, medicolegal issues, and physician and patient acceptance of the AI ‘black-box’ algorithms. DL could potentially revolutionise how ophthalmology is practised in the future. This review provides a summary of the state-of-the-art DL systems described for ophthalmic applications, potential challenges in clinical deployment and the path forward.
- public health
Statistics from Altmetric.com
Artificial intelligence (AI) is the fourth industrial revolution in mankind’s history.1 Deep learning (DL) is a class of state-of-the-art machine learning techniques that has sparked tremendous global interest in the last few years.2 DL uses representation-learning methods with multiple levels of abstraction to process input data without the need for manual feature engineering, automatically recognising the intricate structures in high-dimensional data through projection onto a lower dimensional manifold.2 Compared with conventional techniques, DL has been shown to achieve significantly higher accuracies in many domains, including natural language processing, computer vision3–5 and voice recognition.6
In medicine and healthcare, DL has been primarily applied to medical imaging analysis, in which DL systems have shown robust diagnostic performance in detecting various medical conditions, including tuberculosis from chest X-rays,7 8 malignant melanoma on skin photographs9 and lymph node metastases secondary to breast cancer from tissue sections.10 DL has similarly been applied to ocular imaging, principally fundus photographs and optical coherence tomography (OCT). Major ophthalmic diseases which DL techniques have been used for include diabetic retinopathy (DR),11–15 glaucoma,11 16 age-related macular degeneration (AMD)11 17 18 and retinopathy of prematurity (ROP).19 DL has also been applied to estimate refractive error and cardiovascular risk factors (eg, age, blood pressure, smoking status and body mass index).20 21
A primary benefit of DL in ophthalmology could be in screening, such as for DR and ROP, for which well-established guidelines exist. Other conditions, such as glaucoma and AMD, may also require screening and long-term follow-up. However, screening requires tremendous manpower and financial resources from healthcare systems, in both developed countries and in low-income and middle-income countries. The use of DL, coupled with telemedicine, may be a long-term solution to screen and monitor patients within primary eye care settings. This review summarises new DL systems for ophthalmology applications, potential challenges in clinical deployment and potential paths forward.
DL applications in ophthalmology
Globally, 600 million people will have diabetes by 2040, with a third having DR.22 A pooled analysis of 22 896 people with diabetes from 35 population-based studies in the USA, Australia, Europe and Asia (between 1980 and 2008) showed that the overall prevalence of any DR (in type 1 and type 2 diabetes) was 34.6%, with 7% vision-threatening diabetic retinopathy.22 Screening for DR, coupled with timely referral and treatment, is a universally accepted strategy for blindness prevention. DR screening can be performed by different healthcare professionals, including ophthalmologists, optometrists, general practitioners, screening technicians and clinical photographers. The screening methods comprise direct ophthalmoscopy,23 dilated slit lamp biomicroscopy with a hand-held lens (90 D or 78 D),24 mydriatic or non-mydriatic retinal photography,23 teleretinal screening,25 and retinal video recording.26 Nonetheless, DR screening programmes are challenged by issues related to implementation, availability of human assessors and long-term financial sustainability.27
Over the past few years, DL has revolutionised the diagnostic performance in detecting DR.2 Using this technique, many groups have shown excellent diagnostic performance (table 1).14 Abràmoff et al 14 showed that a DL system was able to achieve an area under the receiver operating characteristic curve (AUC) of 0.980, with sensitivity and specificity of 96.8% and 87.0%, respectively, in the detection of referable DR (defined as moderate non-proliferative DR or worse, including diabetic macular oedema (DMO)) on Messidor-2 data set. Similarly, Gargeya and Leng15 reported an AUC of 0.97 using cross-validation on the same data set, and 0.94 and 0.95 in two independent test sets (Messidor-2 and E-Ophtha).
More recently, Gulshan and colleagues12 from Google AI Healthcare reported another DL system with excellent diagnostic performance. The DL system was developed using 128 175 retinal images, graded between 3 and 7 times for DR and DMO by a panel of 54 US licensed ophthalmologists and ophthalmology residents between May and December 2015. The test set consisted of approximately 10 000 images retrieved from two publicly available databases (EyePACS-1 and Messidor-2), graded by at least seven US board-certified ophthalmologists with high intragrader consistency. The AUC was 0.991 and 0.990 for EyePACS-1 and Messidor-2, respectively (table 1).
Although a number of groups have demonstrated good results using DL systems on publicly available data sets, the DL systems were not tested in real-world DR screening programmes. In addition, the generalisability of a DL system to populations of different ethnicities, and retinal images captured using different cameras, still remains uncertain. Ting et al 11 reported a clinically acceptable diagnostic performance of a DL system, developed and tested using the Singapore Integrated Diabetic Retinopathy Programme over a 5-year period, and 10 external data sets recruited from 6 different countries, including Singapore, China, Hong Kong, Mexico, USA and Australia. The DL system, developed using the DL architecture VGG-19, was reported to have AUC, sensitivity and specificity of 0.936, 90.5% and 91.6% in detecting referable DR. For vision-threatening DR, the corresponding statistics were 0.958, 100% and 91.1%. The AUC ranged from 0.889 to 0.983 for the 10 external data sets (n=40 752 images). More recently, the DL system, developed by Abramoff et al,28 has obtained a US Food and Drug Administration approval for the diagnosis of DR. It was evaluated in a prospective, although observational setting, achieving 87.2% sensitivity and 90.7% specificity.28
Age-related macular degeneration
AMD is a major cause of vision impairment in the elderly population globally. The Age-Related Eye Disease Study (AREDS) classified AMD stages into none, early, intermediate and late AMD.29 The American Academy of Ophthalmology recommends that people with intermediate AMD should be at least seen once every 2 years. It is projected that 288 million patients may have some forms of AMD by 2040,30 with approximately 10% having intermediate AMD or worse.29 With the ageing population, there is an urgent clinical need to have a robust DL system to screen these patients for further evaluation in tertiary eye care centres.
Ting et al 11 reported a clinically acceptable DL system diagnostic performance in detecting referable AMD (table 1). Specifically, the DL system was trained and tested using 108 558 retinal images from 38 189 patients. Fovea-centred images without macula segmentation were used in this study. Given that this was the DR screening population, there were relatively few patients with referable AMD. For the other two studies,17 18 DL systems were developed using the AREDS data set, with a high number of referable AMD (intermediate AMD or worse). Using a fivefold cross-validation, Burlina et al 17 reported a diagnostic accuracy of between 88.4% and 91.6%, with an AUC of between 0.94 and 0.96. Unlike Ting et al,11 the authors presegmented the macula region prior to training and testing, with an 80/20 split between the training and testing in each fold. In terms of the DL architecture, both AlexNet and OverFeat have been used, with AlexNet yielding a better performance. Using the same AREDS data set, Grassmann et al 18 reported a sensitivity of 84.2% in the detection of any AMD. In this study, the authors used six convolutional neural networks—AlexNet, GoogleNet, VGG, Inception-V3, ResNet and Inception-ResNet-V2—to train different models. Data augmentation was also used to increase the diversity of data set and to reduce the risk of overfitting. For the AREDS data set, all the photographs were captured as analogue photographs and then digitised later. Whether this affects the DL system’s performance remains uncertain. In addition, all three abovementioned studies did not have any results for external validation on the individual DL systems.
DM, choroidal neovascularisation and other macular diseases
OCT has had a transformative effect on the management of macular diseases, specifically neovascular AMD and DMO. OCT also provides a near-microscopic view of the retina in vivo with quick acquisition protocols revealing structural detail that cannot be seen using other ophthalmic examination techniques. Thus, the number of macular OCTs has grown from 4.3 million in 2012 to 6.4 million in 2016 in the US Medicare population alone, and will most likely continue to grow worldwide.31
From a DL perspective, macular OCTs possess a number of attractive qualities as a modality for DL. First is the explosive growth in the number of macular OCTs that are routinely collected around the world. This large number of OCTs is required to train DL systems where having many training examples can aid in the convergence of many-layered networks with millions of parameters. Second, macular OCTs have dense three-dimensional structural information that is usually consistently captured. Unlike real-world images or even colour fundus photographs, the field of view of the macula and the foveal fixation is usually consistent from one volume scan to another. This lowers the complexity of the computer vision task significantly and allows networks to reach meaningful performance with smaller data sets. Third, OCTs provide structural detail that is not easily visible using conventional imaging techniques and provide an avenue for uncovering novel biomarkers of the disease.
One of the first applications of DL to macular OCTs was in automated classification of AMD. Approximately 100 000 OCT B-scans were used to train a DL classifier based on VGG-16 to achieve an AUC of 0.97 (table 2).32 Few studies used a technique known as transfer learning, where a neural network is pretrained on ImageNet and subsequently then trained on OCT B-scans for retinal disease classification.33–35 Of note, these initial studies involve the use of two-dimensional DL models trained on single OCT B-scans rather than three-dimensional models trained on OCT volumes. This may be a barrier to their potential clinical applicability.
DL has also had a transformative impact in boundary and feature-level segmentation using neural networks that have been developed for semantic segmentation such as the U-Net.36 Specifically, these networks have been trained to segment intraretinal fluid cysts and subretinal fluid on OCT B-scans.13 37 38 Deep convolutional networks surpassed traditional methods in the quality of segmentation of retinal anatomical boundaries.39–41 Also similar approaches were used to segment en-face OCTA images to segment the foveal avascular zone.42
More recently, DeepMind and the Moorfields Eye Hospital have combined the power of neural networks for both segmentation and classification tasks using a novel AI framework. In this approach, a segmentation network is first used to delineate a range of 15 different retinal morphological features and OCT acquisition artefacts. The output of this network is then passed to a classification network which makes a referral triage decision from four categories (urgent, semiurgent, routine, observation) and classifies the presence of 10 different OCT pathologies (choroidal neovascularisation (CNV), macular oedema without CNV, drusen, geographic atrophy, epiretinal membrane, vitreomacular traction, full-thickness macular hole, partial thickness macular hole, central serous retinopathy and ‘normal’).43 Using this approach, the Moorfields-DeepMind system reports a performance on par with experts for these classification tasks (although in a retrospective setting). Moreover, the generation of an intermediate tissue representation by the first, segmentation network means that the framework can be generalised across OCT systems from multiple different vendors without prohibitive requirements for retraining. In the near term, this DL system will be implemented in an existing real-world clinical pathway—the rapid access ‘virtual’ clinics that are now widely used for triaging of macular disease in the UK.44 In the longer term, the system could be used in triaging patients outside the hospital setting, particularly as OCT systems are increasingly being adopted by optometrists in the community.45
The global prevalence of glaucoma for people aged 40–80 is 3.4%, and by the year 2040 it is projected there will be approximately 112 million affected individuals worldwide.46 Clinicians and patients alike would welcome improvements in disease detection, assessment of progressive structural and functional damage, treatment optimisation so as to prevent visual disability, and accurate long-term prognosis.
Glaucoma is an optic nerve disease categorised by excavation and erosion of the neuroretinal rim that clinically manifests itself by increased optic nerve head (ONH) cupping. Yet, because the ONH area varies by fivefold, there is virtually no cup to disc ratio (CDR) that defines pathological cupping, hampering disease detection.47 Li et al 16 and Ting et al 11 trained computer algorithms to detect the glaucoma-like disc, defined as a vertical CDR of 0.7 and 0.8, respectively. Investigators have also applied machine learning methods to distinguish glaucomatous nerve fibre layer damage from normal scans on wide-angle OCTs (9×12 mm).48 Future opportunities include training a neural network to identify the disc that would be associated with manifest visual field (VF) loss across the spectrum of disc size, as our current treatment strategies are aligned with slowing disease detection. Furthermore, DL could be used to detect progressive structural optic nerve changes in glaucoma.
In glaucoma, retinal ganglion cell axons atrophy in a confined space within the ONH and ophthalmologists typically rely on low dimensional psychophysical data to detect the functional consequences of that damage. The outputs from these tests typically provide reliability parameters, age-matched normative comparisons and summary global indices, but more detailed analysis of this functional data is lacking. Elze et al 49 developed an unsupervised computer program to analyse VF that recognises clinically relevant VF loss patterns and assigns a weighting coefficient for each of them (figure 1). This method has proven useful in the detection of early VF loss from glaucoma.50 Furthermore, a myriad of computer programs to detect VF progression exist, ranging from assessment of global indices over time to point-wise analyses, to sectoral VF analysis; however, these approaches are often not aligned with clinical ground truth nor with one another.51 52 Yousefi et al 53 developed a machine-based algorithm that detected VF progression earlier than these conventional strategies. More machine learning algorithms that provide quantitative information about regional VF progression can be expected in the future.
Although intraocular pressure (IOP)-lowering has been shown to be therapeutically effective in delaying glaucoma progression, some demonstrated that disease progression is still inevitable,54–56 suggesting that we have not arrived at optimised treatment regimens for the various forms of glaucoma. Kazemian et al 57 developed a clinical forecasting tool that uses tonometric and VF data to project disease trajectories at different target IOPs. Further refinement of this tool that integrates other ophthalmic and non-ophthalmic data would be useful to establish target IOPs and the best strategies to achieve them on a case-by-case basis. Finally, it is documented that patients with newly diagnosed glaucoma harbour fears of going blind58; perhaps, the use of machine learning that incorporates genome-wide data, lifestyle behaviour and medical history into a forecasting algorithm will allow early prognostication regarding the future risk of requiring invasive surgery or losing functional vision from glaucoma.
As machine learning algorithms are revised, the practising ophthalmologist will have a host of tools available to diagnose glaucoma, detect disease progression and identify optimised treatment strategies using a precision medicine approaches. In an ideal future scenario, they may also have clinical forecasting tools that inform patients as to their overall prognosis and expected clinical course with or without treatment.
Retinopathy of prematurity
ROP is a leading cause of childhood blindness worldwide, with an annual incidence of ROP-related blindness of 32 000 worldwide.59 The regional epidemiology of the disease varies based on a number of factors, including the number of preterm births, neonatal mortality of preterm children and capacity to monitor exposure to oxygen. ROP screening either directly via ophthalmoscopic examination or telemedical evaluation using digital fundus photography can identify the earliest signs of severe ROP, and with timely treatment can prevent most cases of blindness from ROP.60 61 Due to the high number of preterm births, reductions in neonatal mortality, and limited capacity for oxygen monitoring and ROP screening, the highest burden of blinding ROP today is in low-income and middle-income countries.62
There are two main barriers to effective implementation of ROP screening: (1) the diagnosis of ROP is subjective, with significant interexaminer variability in the diagnosis leading to inconsistent application of evidence-based interventions63; and (2) there are too few trained examiners in many regions of the world.64 Telemedicine has emerged as a viable model to address the latter problem, at least in regions where the cost of a fundus camera is not prohibitive, by allowing a single physician to virtually examine infants over a large geographical area. However, telemedicine itself does not solve the subjectivity problem in ROP diagnosis. Indeed, the acute-phase ROP study found nearly 25% of telemedicine examinations by trained graders required adjudication because the graders disagreed on one of three criteria for clinically significant ROP.65
There have been a number of early attempts to use DL for automated diagnosis of ROP,19 66 which could potentially address both implementation barriers for ROP screening. Most recently, Brown et al 19 reported the results of a fully automated DL system that could diagnose plus disease, the most important feature of severe ROP, with an AUC of 0.98 compared with a consensus reference standard diagnosis combining image-based diagnosis and ophthalmoscopy (table 1). When directly compared with the eight international experts in ROP diagnosis, the i-ROP DL system agreed with the consensus diagnosis more frequently than six out of eight experts. Subsequent work found that the i-ROP DL system could also produce a severity score for ROP that demonstrated promise for objective monitoring of disease progression, regression and response to treatment.67 When compared with the same set of 100 images ranked in order of disease severity by experts, the algorithm had 100% sensitivity an 94% specificity in the detection of pre-plus or worse disease.
Despite the high level of accuracy of the AI-based models in many of the diseases in ophthalmology, there are still many clinical and technical challenges for clinical implementation and real-time deployment of these models in clinical practice (table 3). These challenges could arise in different stages in both the research and clinical settings. First, many of the studies have used training data sets from relatively homogeneous populations.12 14 15 AI training and testing using retinal images is often subject to numerous variabilities, including width of field, field of view, image magnification, image quality and participant ethnicities. Diversifying the data set, in terms of ethnicities, and image-capture hardware could help to address this challenge.11
Another challenge in the development of AI models in ophthalmology has been the limited availability of large amounts of data for both the rare diseases (eg, ocular tumours) and for common diseases which are not imaged routinely in clinical practice such as cataracts. Furthermore, there are diseases such as glaucoma and ROP where there will be disagreement and interobserver variability in the definition of the disease phenotype. The algorithm learns from what they are presented with. The software is unlikely to produce accurate outcomes if the training set of images given to the AI tool is too small or not representative of real patient populations. More evidence on ways of getting high-quality ground-truth labels is required for different imaging tools. Krause et al 68 reported that adjudication grades by retina specialists were a more rigorous reference standard, especially to detect artefacts and missed microaneurysms in DR, than a majority decision and improved the algorithm performance.
Second, many AI groups have reported robust diagnostic performance for their DL systems, although some papers did not show how the power calculation was performed for the independent data sets. A power calculation should take the following into consideration: the prevalence of the disease, type 1 and 2 errors, CIs, desired precision and so on. It is important to first preset the desired operating threshold on the training set, followed by analysis of performance metrics such as sensitivity and specificity on the test set to assess calibration of the algorithm.
Third, large-scale adoption of AI in healthcare is still not on the horizon as clinicians and patients are still concerned about AI and DL being ‘black-boxes’. In healthcare, it is not only the quantitative algorithmic performance, but the underlying features through which the algorithm classifies disease which is important to improve physician acceptance. Generating heat maps highlighting the regions of influence on the image which contributed to the algorithm conclusion may be a first step (figure 2), although such maps are often challenging to interpret (what does it mean if a map highlights an area of vitreous on an OCT of a patient with drusen?).69 They may also struggle to deal with negations (what would it mean to highlight the most important part of an ophthalmic image that demonstrates that there is no disease present?).70 71 An alternative approach has been used for the DL system developed by the Moorfields Eye Hospital and DeepMind—in this system, the generation of an intermediate tissue representation by a segmentation network is used to highlight for the clinician (and quantify) the relevant areas of retinal pathology (figure 3).43 It is also important to highlight that ‘interpretability’ of DL systems may mean different things to a healthcare professional than to a machine learning expert. Although it seems likely that interpretable algorithms will be more readily accepted by ophthalmologists, future applied clinical research will be necessary to determine whether this is the case and whether it leads to tangible benefits for patients in terms of clinical effectiveness.
Lastly, the current AI screening systems for DR have been developed and validated using two-dimensional images and lack stereoscopic qualities, thus making identification of elevated lesions like retinal tractions challenging. Incorporating the information from multimodal imaging in future AI algorithms may potentially address this challenge. In addition, the medicolegal aspects and the regulatory approvals vary in different countries and settings, and more work will be needed in these areas. An important challenge to the clinical adoption of AI-based technology is how the patients entrust clinical care to machines. Keel et al 72 evaluated the patient acceptability of AI-based DR screening within endocrinology outpatient setting and reported that 96% of participants were satisfied or very satisfied with the automated screening model.72 However, in different populations and settings, the patient’s acceptability for AI-based screening may vary and may pose challenge in its implementation.
DL is the state-of-the-art AI machine learning technique that has revolutionised the AI field. For ophthalmology, DL has shown clinically acceptable diagnostic performance in detecting many retinal diseases, in particular DR and ROP. Future research is crucial in evaluating the clinical deployment and cost-effectiveness of different DL systems in the clinical practice. To improve clinical acceptance of DL systems, it is important to unravel the ‘black-box’ nature of DL using existing and future methodologies. Although there are challenges ahead, DL will likely impact on the practice of medicine and ophthalmology in the coming decades.
Contributors DSWT, LRP, LP, JPC, AYL, RR, GSWT, LS, PAK and TYW have all contributed to manuscript drafting, literature review, critical appraisal and final approval of the manuscript.
Funding This project received funding from the National Medical Research Council (NMRC), Ministry of Health (MOH), Singapore National Health Innovation Center, Innovation to Develop Grant (NHIC-I2D-1409022), SingHealth Foundation Research Grant (SHF/FG648S/2015), and the Tanoto Foundation, and unrestricted donations to the Retina Division, Johns Hopkins University School of Medicine. For the Singapore Epidemiology of Eye Diseases (SEED) study, we received funding from NMRC, MOH (grants 0796/2003, IRG07nov013, IRG09nov014, STaR/0003/2008 and STaR/2013; CG/SERI/2010) and Biomedical Research Council (grants 08/1/35/19/550 and 09/1/35/19/616). The Singapore Integrated Diabetic Retinopathy Programme (SiDRP) received funding from the MOH, Singapore (grants AIC/RPDD/SIDRP/SERI/FY2013/0018 and AIC/HPD/FY2016/0912). In USA, it is supported by the National Institutes of Health (K12 EY027720, R01EY019474, P30EY10572, P41EB015896), by the National Science Foundation (SCH-1622542, SCH-1622536, SCH-1622679) and by unrestricted departmental funding from Research to Prevent Blindness. PAK is supported by a UK National Institute for Health Research (NIHR) Clinician Scientist Award (NIHR-CS--2014-12-023). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.
Competing interests DSWT and TYW are the coinventors of a deep learning system for retinal diseases. LP is a member of Google AI Healthcare. LRP is a non-paid consultant for Visulytix. PAK is a consultant for DeepMind.
Patient consent Not required.
Provenance and peer review Not commissioned; internally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.