Background/aims Human grading of digital images from diabetic retinopathy (DR) screening programmes represents a significant challenge, due to the increasing prevalence of diabetes. We evaluate the performance of an automated artificial intelligence (AI) algorithm to triage retinal images from the English Diabetic Eye Screening Programme (DESP) into test-positive/technical failure versus test-negative, using human grading following a standard national protocol as the reference standard.
Methods Retinal images from 30 405 consecutive screening episodes from three English DESPs were manually graded following a standard national protocol and by an automated process with machine learning enabled software, EyeArt v2.1. Screening performance (sensitivity, specificity) and diagnostic accuracy (95% CIs) were determined using human grades as the reference standard.
Results Sensitivity (95% CIs) of EyeArt was 95.7% (94.8% to 96.5%) for referable retinopathy (human graded ungradable, referable maculopathy, moderate-to-severe non-proliferative or proliferative). This comprises sensitivities of 98.3% (97.3% to 98.9%) for mild-to-moderate non-proliferative retinopathy with referable maculopathy, 100% (98.7%,100%) for moderate-to-severe non-proliferative retinopathy and 100% (97.9%,100%) for proliferative disease. EyeArt agreed with the human grade of no retinopathy (specificity) in 68% (67% to 69%), with a specificity of 54.0% (53.4% to 54.5%) when combined with non-referable retinopathy.
Conclusion The algorithm demonstrated safe levels of sensitivity for high-risk retinopathy in a real-world screening service, with specificity that could halve the workload for human graders. AI machine learning and deep learning algorithms such as this can provide clinically equivalent, rapid detection of retinopathy, particularly in settings where a trained workforce is unavailable or where large-scale and rapid results are needed.
- Public health
- Treatment Medical
- Diagnostic tests/Investigation
- Medical Education
- Clinical Trial
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
- Public health
- Treatment Medical
- Diagnostic tests/Investigation
- Medical Education
- Clinical Trial
Diabetic eye disease is a microvascular complication of Type 1 and Type 2 diabetes, and is a leading cause of incident blindness in people of working age. Early detection through screening programmes for diabetic retinopathy (DR) with digital retinal imaging can reduce the impact of this condition.1 National screening programmes for DR, including the National Health Service (NHS) Diabetic Eye Screening Programme (DESP), represent a major challenge to healthcare providers. The incidence and prevalence of diabetes mellitus are increasing; 425 million adults were estimated by the International Diabetes Federation to be living with diabetes in 2017, a prevalence that has doubled since 1980 and is projected to rise to 629 million by 2045.2 Screening for DR is generally based on human grading, which is labour intensive, requiring human graders, who should be trained, undergo regular quality assurance and be retained. Applying the current UK annual screening protocol globally would require 2.2 billion retinal images to be graded in 2030. Emerging automated retinal image analysis systems (ARIAS) are artificial intelligence (AI) (machine learning) algorithms3 4 that may provide cost-effective alternatives to human grading, designed to have a high sensitivity for detection of sight-threatening retinopathy in need of clinical intervention. These systems could be used to triage those who have sight-threatening DR or other retinal abnormalities, from those at low risk of sight-threatening retinopathy.
This study expands on our previous study published in 2016–20175 6 that quantified the screening performance and diagnostic accuracy of three ARIAS using NHS DESP human grading as a reference standard. Two of the ARIAS achieved acceptable sensitivity when compared with human graders and had specificities that made them cost-effective alternatives to human grading alone. We reported how such software could be incorporated into pre-existing screening pathways6 and that replacement of a primary human grader was the most cost-effective strategy, compared with a strategy that used the ARIAS as a filter prior to primary human grader (figure 1 shows the pathway on which cost-effectiveness analyses were performed.6)
This study provides a large prospective evaluation using an updated version of one of the ARIAS in three current DESPs conducted in North East London (NEL), South East London (SEL) and Gloucestershire (GS) during a different time period to the first study. Therefore, different populations with diabetes, different human grading teams and different cameras are contributing to this evaluation of a single algorithm. In its current form, the ARIAS is not considered precise enough to be a diagnostic test capable of replacing human grading altogether. The updated version of the ARIAS evaluated in this study has now received CE marking (Conformité Européene marking indicates that it is compliant with European Union product legislation and meets safety, health or enviromental requirements) and is in its final stages of Food and Drug Administration evaluation. However, no large independent evaluation on a UK or other screening population has yet been undertaken. The aim of this study is to evaluate whether the ARIAS is sensitive and specific enough to successfully triage patients into low- and medium-/high-risk DR cases, thereby potentially reducing the need for all screening episodes to receive human grading, allowing resources to be focused on high-risk screening episodes. Hence, the aim is to evaluate ARIAS in combination with manual grading, rather than to replace all manual grading.
Three screening programmes contributed to the study, DESPs of NEL, GS and SEL. Programme activity and patient demographic characteristics for the 2017–2018 period are presented in table 1. The rate of uptake of the screening invitation ranged from 77% to 83%. Sex differences were similar, with more men being screened in each centre reflecting the predominance of males with diabetes. There was a higher proportion of Asian and black patients screened within the London centres, which reflects the local population with diabetes.
In the current screening pathway, all retinal images are reviewed by a primary grader, and any patients with mild or worse retinopathy or maculopathy (in addition to the 10% graded ‘no retinopathy’) are reviewed by a secondary grader, with discrepancies between the primary and secondary grader reviewed by an arbitration grader (tertiary grader).7 A sample size commensurate with our previous work5 6 was sought. Hence, our target was to obtain retinal images from approximately 10 000 consecutive screening episodes at each centre that had been graded for retinopathy level by the local team of graders as part of standard care. At least two digital image fields were taken of each eye, one centred on the optic disc and the other on the macula, in accordance with NHS DESP protocol.8 Images that were of poor quality or classified as ungradable by human graders were included in the data set. The data set included all images captured for each eye for each episode, including partial retinal images, poor quality images and non-retinal images such as cataracts. Hence, this study analysed 30 000 real live episodes with images captured as part of routine clinical care without any editing or selection of images prior to processing with ARIAS, EyeArt. Research Governance approval for the study was obtained. Images were pseudonymised, and no change in the clinical pathway occurred.
The CE-marked software complied with relevant European legislation, EyeArt v2.1.0 (Eyenuk, Woodland Hills, CA), was used to assess retinal images for each person for the presence of DR. The software was installed on 3 January 2017 on two servers within a secure computer facility at the Homerton Hospital. Pseudo-anonymised image data were brought to the secure computer facility from the three sites using encrypted disks and processed. Images from NEL were processed on 14 January 2017, from SEL 6 February 2017 and from GS 25 April 2017. The patient clinical pathway was not affected by the processing of images by the machine learning algorithm as this process was undertaken in parallel with usual care. The human grade for each image was not available while the images were being processed by the EyeArt software. EyeArt classification for the presence of retinopathy for each person was either test-positive or test-negative. Episodes with images found to be un-assessable, that is, technical failure by the software, were allocated as test-positive, on the basis that these cases would need further human assessment.
The human grading was undertaken in each programme using NHS DESP guidelines.8 9 The final human grade in the worst eye was used as the reference standard for all analyses presented. We have previously shown that the final human grade is a stable reference standard since measures of screening performance and accuracy of performance metrics were not materially altered when the reference standard was further refined by arbitration by an expert reading centre.5 6 Human grading classifications were, no observable retinopathy (R0), mild-to-moderate non-proliferative retinopathy (R1), non-referable maculopathy (M0), ungradable images (U), referable maculopathy (M1), moderate-to-severe non-proliferative retinopathy (R2) and proliferative retinopathy (R3). The commensurate ETDRS retinopathy grade scores are as follows9 10: R0 equivalent to ‘no apparent retinopathy’; R1 ETDRS scores 20–35 inclusive; R2 ETDRS scores 43–53 inclusive; R3 ETDRS scores 61+. A more detailed description of NHS DESP grades alongside ETDRS final retinopathy grade is available.9 In the English DESP, retinopathy grades R0M0, R1M0 are non-referable retinopathy and grades M1, R2 and R3 are referable retinopathy. Patients with ungradable images are referred for slit-lamp biomicroscopy within the DESP. Patients with referable retinopathy are then referred to the hospital eye service. Patients with non-referable retinopathy receive an invitation for re-screening within 1 year.
All analyses were performed using STATA 15.0 IC (STATA Corps LP, College Station, TX USA). Estimates of screening performance and diagnostic accuracy were determined (sensitivity/detection rate, false-positive rates, specificity) and the corresponding 95% CIs for each centre and overall were also determined. Screening performance estimates are given for each grade separately as well for referable versus non-referable retinopathy. Logit-transformed 95% CIs were determined or binomial exact CIs in the presence of detection rates of 100%. Systematic differences between centres in the likelihood of screening outcome being classified as test-positive versus test-negative, conditional on human retinopathy grade, were examined using a χ2 test.
The number of consecutive screening episodes with complete human grading are given in table 2 for each centre and overall. Each centre contributed at least 10 000 consecutive screening episodes. The prevalence of different retinopathy grades was similar in NEL and SEL with marginally lower prevalence of retinopathy grades R1M1, R2 and R3 (and consequently a higher proportion of R1M0 and R0 M0) in GS compared with the other two centres. The proportion of screening episodes that were ungradable ranged from 1.9% in GS DESP up to 3% at the NEL DESP.
Table 3 presents the EyeArt classification of test-negative and test-positive for each screening episode by centre and overall. For episodes manually graded as R0M0 (no retinopathy), the specificity (EyeArt classification of test-negative) overall is about two-thirds, ranging from 65% in SEL to 71% in GS. A high proportion (from 84% to 93%) of those with mild-to-moderate non-proliferative retinopathy and no-referable maculopathy (human grade R1M0) were classified as test-positive by EyeArt software. All cases of moderate-to-severe non-proliferative retinopathy and proliferative retinopathy (human grades R2 and R3, with or without macular involvement) were detected by the software as test-positive or ‘technical failure’, that is, a sensitivity of 100%.
Table 4 provides measures of diagnostic accuracy in terms of the 95% confidence limits (CI) around the estimates for each centre and overall for episodes classified as test-positive (including technical failures) by the software.
The sensitivity (95% CIs) of EyeArt was 95.7% (94.8% to 96.5%) for referable retinopathy (human graded ungradable, referable maculopathy, moderate-to-severe non-proliferative or proliferative: grades U, M1, R2 and R3). This figure is composed of sensitivities of 98.3% (97.3% to 98.9%) for mild-to-moderate non-proliferative retinopathy with referable maculopathy (R1M1), 100% (98.7% to 100%) for moderate-to-severe non-proliferative retinopathy (R2), and 100% (97.9 to, 100%) for proliferative disease (R3) (table 4).
For non-referable retinopathy (human grade R1M0), 89.1% (88.4% to 89.9%) were classified as test-positive (this is the sensitivity/detection rate for non-referable retinopathy).
EyeArt agreed with the human grade of no retinopathy (R0M0) in 68% (67% to 69%). Hence, the specificity for no retinopathy (test-negative) is 68%; the equivalent expressed as a false-positive rate is 32.1% (31.5% to 32.7%) (table 4). If this group is considered in combination with non-referable retinopathy (R1M0), the specificity is 54% (53.4% to 54.5%) (equivalent to 46.0% (45.5% to 46.6%) expressed as a false-positive rate), which is the specificity corresponding to the detection rate of 95.7% for referable retinopathy given above.
Approximately 50% (15 091/30 405) of all screening episodes would require further human grading after EyeArt classification and this percentage ranged from 47% to 51% across the three centres.
The EyeArt likelihood ratio for a test-positive result for referable disease was stable across centres. There was some evidence of variation across the three centres in the proportion of episodes classified as test-negative for screening episodes with human retinopathy grades no retinopathy (R0M0) or mild-to-moderate non-proliferative retinopathy with no-referable maculopathy (R1M0) (interaction p<0.001). However, the maximum absolute difference in specificity between centres was only 3.6 percentage points.
This prospective study has demonstrated the high sensitivity of an AI-enabled algorithm to detect referable DR across three different UK screening centres, with a diverse ethnic mix of individuals with diabetes. Among 30 405 screening episodes, all 462 cases of moderate-to-severe non-proliferative retinopathy (human grade R2) and proliferative DR (human grade R3) were classified by EyeArt as test-positive (including technical failures) and would have been sent for human grading. Using the EyeArt system to triage screening episodes (rather than replace manual grading altogether) could halve the workload for human graders. This report used a methodology framework for independent performance evaluation5 that could be applied to future evaluations of other algorithms developed for this purpose.
Strengths of the current study include the large sample size, based on patients from real-world screening environments within three current DESPs, and evaluated independently of any commercial partner. Although there was a small difference in the performance of the EyeArt software across the three centres, mainly in terms of specificity for non-referable retinopathy, this is not surprising given there will be systematic differences between the three DESPs in terms of the patient- and centre-specific characteristics, such as age, ethnicity profiles (fundus pigmentation may influence software performance)5, quality of image capture and variation in human grading (ie, random error or systematic grader-bias), which is more common with lower grades of retinopathy.11 Although our previous work has shown that screening performance of two ARIAS appeared to be robust to variations in age, ethnicity and camera type, this was only within one DESP, which is likely to show less variation than across different DESPs.5 6 It is noteworthy that detection rates were high and likelihood ratios for referable retinopathy were stable across centres. The sensitivity of EyeArt is high for detection of sight-threating retinopathy and exceeds any published AI algorithm to date.12 Importantly, the specificity is sufficient to make an even greater cost-saving to the NHS than previously described.5
Automated screening software, including recent developments using machine learning, has been available for some time,13 but independent, large-scale validation of commercially available licences14 has been limited until recently.5 6 Population screening programmes have not routinely used automation for retinopathy detection, with the exception of the Scottish national programme, which uses a computer program to triage macula-centred images (one per eye) into presence or absence of disease.15 However, in our previous work, we showed that this system could not operate on disc-centred images as part of the English DESP.5
The methods previously described5 have been adopted by other groups, when validating new approaches to automated assessment of retinal images.16 This latter work tested the algorithm on a data set of over 10 000 retinal images. Our current work tests the software on over 120 000 retinal images (30 000 screening episodes with a minimum of four retinal images per episode and on average five images per patient).
A recent evaluation of this software on the EyePACS telescreening programme by the software developer achieved lower levels of sensitivity for referable DR of 91% vs 95.7% in the current study, but with higher levels of specificity of 91% for ‘non-referable retinopathy’.17 Differences in EyeArt performance between this study and ours are likely to result from differing software thresholding cut-offs being implemented as well as differences in image capture systems, image quality and human grading. Hence, there is an ongoing need to evaluate such systems locally to evaluate the impact on estimates of screening performance of any ARIAS.
Diabetes is recognised as a health challenge for every country. Good glycaemic and blood pressure control reduce the risk of incident diabetic eye disease, but regular screening for DR is necessary to detect sight-threatening diabetic eye disease so that appropriate treatment can be given to prevent vision loss. The UK has one of the largest, systematic DESPs in the world, with well-documented procedures to train and quality assure human graders. However, there are recognised variations in the ability of different countries and healthcare models to provide screening and treatment for sight-threatening complications of DR.
The majority of those with diabetes will have a very low risk of vision loss (either no retinopathy or mild non-proliferative retinopathy). Performance of human graders can vary at this end of the spectrum, without harming the patient. This is because an image with no retinopathy or an image with a single microaneurysm (mild retinopathy) will have the same outcome for the patient, that is, a routine review in 1 year (in the UK), or even less frequently in some countries with longer screening intervals for low-risk disease. Nonetheless, vast numbers of grading hours are devoted to this repetitive task, reducing the time available for grading high-risk images. Machines can address the problem of repeated grading. We have shown one such machine-learning software could halve the amount of DR screening images requiring human grading without missing sight-threatening disease.
Two other measures of screening performance worth mentioning are the positive predictive value and negative predictive value. Among all test-positive results from EyeArt, the probability of any retinopathy was 55% and 14% had referable retinopathy according to human grading. Among all test-negative results from EyeArt when compared with human grader, 94.5% did not have any observable retinopathy, 99% did not have referable retinopathy.
A previous version of the software (v1) with poorer specificity was shown to considerably reduce the cost of screening.5 6 Figure 1 demonstrates the clinical screening pathways that were in use at that time, and the associated cost-effectiveness analysis evaluated a potential strategy of implementing ARIAS into the screening pathway by replacing the primary grader. Hence, all test-positive results from ARIAS (~50% of all screened) would immediately pass to the secondary grader. This is unlikely to result in an increased workload for the secondary grader, because in the existing fully manual grading pathways we have observed, between 40% and 50% of all screening episodes pass from the primary to the secondary grader across the three NHS DESPs included in this study. In addition, about 10% pass to the tertiary grader for arbitration. Under the proposed pathway, only screening episodes classified at test-negative by the ARIAS but as referable retinopathy by the secondary grader would pass to tertiary grader for arbitration. In reality, this will be a small proportion, since only 10% of test-negatives pass to the secondary grader, of which its estimated that 1.7% (tables 3, 4) would include potentially referable disease. This would equate to approximately 50 cases per 30 000 screening episodes. Hence, the workload for the tertiary (arbitration) grader would reduce. The implementation of ARIAS as outlined in the figure could potentially save £0.5 million per 100 000 screening episodes, offering huge savings in relation to over 2.2 million screening episodes per year (2017–2018) in England alone, with numbers increasing.18 We also demonstrated that this was more cost-effective than using the ARIAS as a filter prior to human grading by level 1 graders.5 6 The performance of the software (v2.1.0) in this current study has improved since the original report and we expect, therefore, that the cost savings will also have improved.
The number of people diagnosed with diabetes globally is estimated at 625 million by 2045,2 which would generate between 2 and 3 billion retinal images per year following an English screening programme approach. Current evaluations predict that diabetes-associated blindness is likely to rise dramatically in the developing world.19 Given the numbers of people with diabetes, costs and quality of eye care will become even more important. The use of AI represents a new avenue for DR screening,13 16–20 and the use of neural networks has also demonstrated promise in staging DR and triaging other retinal conditions.21 22 This technology could be extended to screening programmes in developed and developing nations. The potential for nearly instantaneous triage at the point of image capture has not yet been fully explored but is another area where this technology could enhance diabetes eye care. Integration of ARIAS with OCT (for detection of macular oedema) at image capture is another potential pathway. All systems should undergo rigorous independent evaluation before changes are made to existing DR screening pathways.
Thanks to Steve Chave for preparing the Gloucestershire data set.
Correction notice This paper has been amended since it was published online. We have corrected author Alain du Chemin’s first name.
Contributors CE, CGO, AT and ARR designed the study and raised funding. PH, CE, LB, RC, JA, SA, IS, PS, LW, SM, AdC and AT were involved in collection of data for the study. ARR undertook data management and analysed the data; the writing committee (PH, CE, CGO, AT and ARR) wrote the first draft of the report, which was critically appraised by all authors. The final draft was approved by all authors. ARR is responsible for data integrity.
Funding This research has received a proportion of its funding from the Department of Health’s NIHR Biomedical Research Centre for Ophthalmology at Moorfields Eye Hospital and UCL Institute of Ophthalmology. The views expressed in the publication are those of the authors and not necessarily those of the Department of Health. Diabetes prevention research at St George’s, University of London, is supported by the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) (grant reference NIHR200152).
Competing interests None to declare.
Ethics statement According to the UK Governance Arrangements for Research Ethics Committees (GAfREC), ethical review is not required for anonymised data. This was confirmed by the Caldicott Guardian, who is responsible for protecting the confidentiality of patients seen in the National Health Service (NHS).
Provenance and peer review Not commissioned; internally peer reviewed.
Data sharing statement For general data sharing inquiries, contact Professor Alicja Rudnicka .
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.