Article Text
Abstract
Aim To perform an independent validation of deep learning (DL) algorithms for automated scleral spur detection and measurement of scleral spur-based biometric parameters in anterior segment optical coherence tomography (AS-OCT) images.
Methods Patients receiving routine eye care underwent AS-OCT imaging using the ANTERION OCT system (Heidelberg Engineering, Heidelberg, Germany). Scleral spur locations were marked by three human graders (reference, expert and novice) and predicted using DL algorithms developed by Heidelberg Engineering that prioritise a false positive rate <4% (FPR4) or true positive rate >95% (TPR95). Performance of human graders and DL algorithms were evaluated based on agreement of scleral spur locations and biometric measurements with the reference grader.
Results 1308 AS-OCT images were obtained from 117 participants. Median differences in scleral spur locations from reference locations were significantly smaller (p<0.001) for the FPR4 (52.6±48.6 µm) and TPR95 (55.5±50.6 µm) algorithms compared with the expert (61.1±65.7 µm) and novice (79.4±74.9 µm) graders. Intergrader reproducibility of biometric measurements was excellent overall for all four (intraclass correlation coefficient range 0.918–0.997). Intergrader reproducibility of the expert grader (0.567–0.965) and DL algorithms (0.746–0.979) exceeded that of the novice grader (0.146–0.929) for images with narrow angles defined by OCT measurement of angle opening distance 500 µm anterior to the scleral spur (AOD500)<150 µm.
Conclusions DL algorithms on the ANTERION approximate expert-level measurement of scleral spur-based biometric parameters in an independent patient population. These algorithms could enhance clinical utility of AS-OCT imaging, especially for evaluating patients with angle closure and performing intraocular lens calculations.
- anterior chamber
- angle
- glaucoma
- diagnostic tests/investigation
Data availability statement
Data are available on reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Deep learning (DL) algorithms can detect scleral spur locations in anterior segment optical coherence tomography (AS-OCT) images with expert-level performance; however, there is sparse information about the accuracy of AS-OCT measurements associated with these predicted scleral spur locations.
WHAT THIS STUDY ADDS
DL algorithms on the ANTERION OCT system (Heidelberg Engineering, Heidelberg, Germany) approximate expert-level detection of the scleral spur and measurement of anterior segment biometric parameters in a real-world clinical cohort. Performance of the algorithms generally exceeds that of a novice grader.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
The automation of scleral spur detection and quantitative biometric analysis overcomes the time-dependent and expertise-dependent nature of AS-OCT imaging in the clinical setting. This technology provides clinicians with convenient access to data that could enhance care of patients with angle closure disease or patients receiving intraocular lens implantation.
Introduction
The biometric properties of the anterior segment and its anatomical structures play an important role in the clinical care of patients with a range of ocular conditions. Specifically, anterior segment biometrics play an important role in the pathogenesis of primary angle closure disease (PACD), in which aqueous humour outflow is impaired by apposition of the iris and trabecular meshwork, and closure of the anterior chamber angle (ACA).1–3 This process leads to primary angle closure glaucoma, a major cause of visual morbidity worldwide that currently affects more than 20 million people.4 5 In addition, the surgical treatment of eyes with cataract and high refractive error benefits from accurate biometric measurements when calculating the power and size of intraocular lenses (IOLs). Incorrect lens power leads to poor visual outcomes, and incorrect lens sizing can lead to harmful complications such as hyphema, uveitis, glaucomatous optic neuropathy or corneal decompensation.6 7
There is a growing evidence that supports the clinical utility of anterior segment optical coherent tomography (AS-OCT) for measuring anterior segment biometrics, many of which are based on scleral spur location. For example, angle opening distance (AOD) and trabecular iris space area (TISA) may find expanded roles in predicting progression of PACD and response to treatment with laser peripheral iridotomy (LPI).3 8 9 Quantitative OCT-based methods could complement gonioscopy, which remains the current standard for assessing the ACA despite being subjective, qualitative, variably reproducible and weakly correlated with AS-OCT measurements of angle width.10–16 In IOL selection, biometric parameters, including corneal curvature, anterior chamber depth (ACD) and lens thickness, are measured using optical or ultrasound methods and factored into modern IOL calculators.17 Anterior chamber width (ACW), also referred to as white-to-white distance, is important for sizing anterior chamber and phakic IOLs.18–21 Biometric parameters based on scleral spur location, such as lens vault (LV) and ACW, are potentially useful in IOL selection, but are difficult to measure and therefore rarely used in routine clinical practice.22 23
Full biometric analysis of AS-OCT images on commercial devices currently requires specialised software and manual marking of scleral spurs, which is expertise-dependent and time-consuming, thereby presenting a barrier to widespread implementation.24 25 Prior studies have established the accuracy of scleral spur detection automated using deep learning (DL), a form of artificial intelligence.26 27 In this study, we investigate if biometric measurements associated with scleral spur locations predicted by DL algorithms on the Heidelberg ANTERION V.1.4 swept-source OCT system (Heidelberg Engineering, Heidelberg, Germany) approximate interexpert reproducibility in an independent patient population and clinical environment.
Methods
Scleral spur detection algorithm
DL algorithms to automate scleral spur detection were developed and tested internally by Heidelberg Engineering (Heidelberg, Germany) prior to this study. While these algorithms are proprietary, some information was provided by Heidelberg Engineering about their development. In brief, a set of 4798 ANTERION AS-OCT images from 1 or both eyes of 360 patients were evaluated by an expert ophthalmologist to identify scleral spur locations. These images were divided into non-overlapping training (3810 images; 80%) and test (979 images; 20%) datasets. The training dataset was used to train a convolutional neural network (CNN) based on the M2U-Net architecture that predicts scleral spur location within a predefined region of interest (ROI).28 The ROI is a 256×256 pixel area around the ACA determined heuristically based on the posterior boundary of the cornea and the anterior boundary of the iris as defined by the ANTERION’s segmentation algorithms. Reference scleral spur locations were transformed into reference heatmaps containing a Gaussian function with SD of 10 pixels centred on the reference location. Data augmentation, including affine deformations, noising and blurring, was used to increase the robustness of the CNN. The subpixel-refined position and intensity of the strongest peak in the predicted heatmap were used to estimate the position and confidence (level of certainty ranging from 0 to 1) of the scleral spur. The test dataset was used to select two operating points along the receiver operating characteristic curve (online supplemental figure 1) for further analysis, one more conservative to limit the false positive rate (FPR; scleral spur marked by the algorithm but not the ophthalmologist) below 4% (FPR4 algorithm) and the other more aggressive to ensure a true positive rate (TPR; scleral spur marked by the algorithm and ophthalmologist) above 95% (TPR95 algorithm).
Supplemental material
Acquisition and analysis of validation dataset
Patients 18 years of age and older undergoing routine eye examinations were prospectively and consecutively recruited from attending glaucoma clinics at the Roski Eye Institute at the USC and attending comprehensive ophthalmology clinics at the Doheny Eye Institute at the University of California Los Angeles. Each of these clinics had approximately 20 patients per clinic, the majority of whom were follow-ups. Recruitment occurred from March 2021 to August 2021. Exclusion criteria included corneal opacities that precluded AS-OCT imaging and prior history of ocular trauma.
AS-OCT imaging was performed using the ANTERION and Metrics Application. All images were obtained by trained technicians following a standardised imaging protocol. Imaging of both eyes was performed in the seated position prior to pupillary dilation in a dark room under standardised lighting conditions (<0.01 lux) at the imaging plane. Participants were instructed to maintain fixation on the internal fixation target with their eyelids open without retraction by the technician.
The scleral spur was identified as the inward projection at the junction of the sclera and cornea.29 Scleral spur locations in all six B-scans (separated by 30°, creating 12 angle sectors at 0°, 30°, 60°, 90°, 120°, 150°, 180°, 210°, 240°, 270°, 300° and 330°) were marked by three human graders: (1) an expert trained grader (AAP; reference grader) with experience marking over 40 000 scleral spurs after a 5-hour training period of marking 500 scleral spurs under the supervision of two glaucoma specialists; (2) an expert glaucoma specialist with experience marking over 10 000 scleral spurs (BYX; expert grader); (3) a novice trained grader (ASH; novice grader) with experience marking fewer than 100 scleral spurs. The reference grader previously demonstrated low intragrader variability in scleral spur locations and AS-OCT measurements of biometric parameters.27 30 Scleral spur locations were also predicted by the FPR4 and TPR95 algorithms.
The anterior and posterior boundaries of the cornea and lens, and the anterior boundary of the iris, were computed automatically by the ANTERION’s segmentation algorithm. The reference grader made minor segmentation adjustments of the angle recess, including the posterior cornea and anterior iris, in fewer than 15 images (1.1% of total) prior to obtaining biometric measurements. After the scleral spurs were marked, eight scleral spur-based biometric parameters were measured in an automated fashion by the ANTERION software: AOD, TISA and scleral spur angle (SSA) at 500 and 750 µm from the scleral spur, ACW, and LV. AOD500/750 was defined as the perpendicular distance from the TM at 500 or 750 µm anterior to the scleral spur to the anterior iris surface. TISA500/750 was defined as the area bounded anteriorly by AOD500/750; posteriorly by a line drawn from the scleral spur perpendicular to the plane of the inner scleral wall to the opposing iris; superiorly by the inner corneoscleral wall; and inferiorly by the iris surface. SSA500/750 was defined as the angles formed by lines originating at the scleral spur and terminating at the TM or anterior iris surface 500 or 750 µm anterior to the scleral spur. ACW was defined as the distance between scleral spurs. LV was defined as the perpendicular distance from the apex of the anterior lens surface to a line between scleral spurs.
A subset of images was classified as having narrow angles, defined as an AOD500 measurement <150 µm by the reference grader. This threshold was chosen to define narrow angles due to its high sensitivity and specificity of less strict AOD500 thresholds for detecting gonioscopic angle closure in a prior study.31 Narrow angles were not defined based on gonioscopy for several reasons: (1) the majority of patients were not glaucoma patients, and therefore, did not receive gonioscopy; (2) there is intergrader variability in the detection of gonioscopic angle closure and (3) angle widths associated with gonioscopic angle closure vary significantly by quadrant.10 31 32
Images with borderline or poor interpretability due to eyelid and other imaging artefacts were included in the analysis so that false negative rates (FNRs) and FPRs could be calculated for the expert and novice graders and both DL algorithms. In addition, human graders were not provided specific instruction about what constituted a gradable scleral spur; the decision to grade an image was left to the discretion of each grader. A reference false negative (FNref) was defined as a scleral spur identified by the reference grader but not by another grader or algorithm. A reference false positive (FPref) was defined as a scleral spur marked by another grader or algorithm but not by the reference grader. A consensus false negative was defined as a scleral spur marked by all three human graders but not by a DL algorithm. A consensus false positive was defined as a scleral spur marked by a DL algorithm but not any of the three human graders.
Statistical analysis
Scleral spur location differences were calculated as the Euclidean distance between scleral spur locations by the reference grader and second human grader or DL algorithm. Normality testing was performed on scleral spur location differences using the Kolmogorov-Smirnov test. Medians and IQRs were calculated based on non-normality of the data. Scleral spur location differences were grouped by grader or algorithm and compared using the Kruskal-Wallis test. Pairwise comparisons of scleral spur location differences between groups (six comparisons in total) were performed using the post hoc Dunn’s test adjusted for multiple comparisons at a significance level of 0.05. Intraclass correlation coefficients (ICCs) were calculated for each biometric parameter measured in all AS-OCT images to assess the intergrader agreement between the reference grader and a second human grader (expert or novice) or DL algorithm (FPR4 and TPR95). ICCs were also calculated for each biometric parameter measured in a single sector (superior or temporal) of only one eye per participant to eliminate intraeye and intraparticipant correlations. Bland-Altman plots were generated for AOD500 to assess intergrader agreement across the entire range of angle widths. All analyses were performed by using the R statistical package (V.4.0.3) at a significance level of 0.05.
Results
In total, 1308 AS-OCT images were obtained from 117 participants, which included 2616 potential scleral spurs; however, not all of the scleral spurs were gradable due to eyelid or other imaging artefacts. Mean age was 52.1±17.6 years with 59 males (50.4%) and 58 females (49.6%). Among all participants, 50 (42.7%) were Caucasian, 32 (27.4%) were Hispanic, 21 (17.9%) were Asian, and 7 (6.0%) were black, and 7 (6.0%) had unknown race/ethnicity.
In total, the reference grader marked 1504 spurs, the expert grader marked 1726 spurs, the novice grader marked 1622 spurs, the FPR4 algorithm marked 1459 spurs and the TPR95 algorithm marked 1722 spurs. Given that the reference grader detected fewer scleral spurs than other graders, the images marked by the expert grader but not reference grader were reviewed. Among these 237 images, the large majority included eyelid artefacts (N=222, 93.4%) or shadowing by eyelashes or pterygia (N=12, 5.1%) that partially obscured the angle recess. Distributions of scleral spur location differences compared with the reference grader varied by grader or algorithm (figures 1 and 2). Median and IQR of scleral spur location differences were 61.1±65.7 µm for the expert grader, 79.4±74.9 µm for the novice grader, 52.6±48.6 µm for the FPR4 algorithm and 55.5±50.6 µm for the TPR95 algorithm. There were significant differences (p<0.001) among the four groups of scleral spur location differences. Pairwise comparisons demonstrated a non-significant difference in scleral spur location differences between the DL algorithms (p=0.33) and significant differences between all other pairs of graders and algorithms (p<0.001).
There was a wide range of angle widths (mean 0.41±0.25 mm) based on the distribution of AOD500 measurements by the reference grader (online supplemental figure 2). Measurement agreement between the reference grader and the expert grader or either algorithm was excellent (ICC range 0.955–0.997) and similar for all parameters (table 1). Measurement agreement for the novice grader was lower but still excellent for all parameters (ICC range 0.918–0.994). Bland-Altman plots for AOD500 reflected consistent agreement across the entire range of AOD500 measurements for all four (figure 3). ICCs of measurements from only superior or temporal sectors from one eye per participants showed similar trends (online supplemental tables 1 and 2) as the primary analysis.
Among the 1504 AS-OCT images graded by the reference grader, 198 (13.2%) had narrow angles (AOD500<150 µm). Among the subset of participants who received gonioscopy as part of their clinical examination, 9 of 36 (25%) had gonioscopic angle closure (inability to visualise the pigmented trabecular meshwork) in at least two quadrants. Among measurements from these images with narrow angles, ICCs for ACW and LV were similar to those for the overall study population (ICC range 0.856–0.979) whereas ICCs for angle width measurements tended to be lower (ICC range 0.146–0.878) (table 2). ICC is defined as: intereye variance/(intereye variance+intraeye variance). Therefore, the lower ICC values likely reflect the lower inter-eye variance of angle width measurements associated with narrow angles. In contrast, the Bland-Altman plots demonstrate consistent limits of agreement for AOD500 measurements below and above the AOD500 threshold for narrow angles (figure 3). Bland-Altman plots for the TISA500, ACW and LV also showed good interexpert agreement across the range of measurements (online supplemental figures 3–5). The difference in ICC (agreement with the reference grader) between the expert and novice graders was more pronounced, favouring the expert grader, in the subset of narrow angle images compared with all images. The DL algorithms matched if not exceeded the agreement between the expert and reference graders in the subset of narrow angle images (table 2).
Rates of FNref and FPref differed by grader and algorithm (figure 4). The expert and novice graders and TPR95 algorithm all had FNRref<3.0% and FPRref>10.0% whereas the FPR4 algorithm had FNRref=12.6% and FPRref=9.6%. Compared with the consensus, the FNRcon of the FPR4 algorithm (12.3%) was higher than the TPR95 algorithm (2.7%) whereas the difference in the FPRcon was smaller (FPR4 1.1% vs TPR95 4.1%). On visual inspection of misclassified images by the TPR95 algorithm, many of the images had obvious lid, shadowing or motion artefacts that make scleral spur detection difficult (online supplemental figure 6).
Discussion
In this study, DL algorithms for the ANTERION OCT system achieved expert-level performance predicting scleral spur locations and measurements of scleral spur-based biometric parameters in a large set of AS-OCT images from an independent patient population and clinical environment. Both the conservative (FPR4) and aggressive (TPR95) algorithms generally approximated the performance of the expert grader and exceeded that of the novice grader, especially among images with narrow angles. The TPR95 algorithm more closely approximated the FNR and FPR of the human graders, while the FPR4 algorithm made substantially fewer predictions. These findings support the implementation of the TPR95 algorithm for scleral spur detection and automated biometric analysis of ANTERION images, which in turn could greatly enhance the accessibility and utility of quantitative AS-OCT imaging.
Measurements of scleral spur-based biometric parameters are dependent on accurate identification of scleral spur location, which is variable even among experienced graders.24 25 Both the FPR4 and TPR95 algorithms produced similar accuracy in predicting scleral spur locations relative to the reference grader, with median differences that were smaller than those of the expert and novice graders (<60 µm for both algorithms). This performance is comparable to that of a DL algorithm developed by Xu et al for the Tomey CASIA SS-1000, in which the mean human-machine scleral spur location difference was 73.08±52.06 µm.27 Pham et al developed a different DL algorithm for the CASIA SS-1000 and plots of human-human and human-machine differences are on a similar scale to those from this study.26 These findings suggest that the FPR4 and TPR95 algorithms achieve expert-level performance in scleral spur detection that approximates if not exceeds the agreement between two experienced graders.
Limited access to quantitative measurements of scleral spur-based biometric parameters has hindered the development and implementation of novel clinical methods for evaluating and treating a range of ocular conditions, including PACD, refractive error and cataract. Our findings suggest that biometric measurements associated with scleral spur predictions by both algorithms are highly correlated with measurements by the reference grader and approximate the agreement between two experienced human graders, including in eyes with narrow angles. An automated method that provides access to expert-level measurements of scleral spur-based biometric measurements could help modernise the clinical evaluation and management of patients with PACD. Measurements of AOD and TISA are associated with IOP and anatomical variations in PACD eyes and may predict a higher risk of PACD progression or poor angle widening after LPI.8 15 16 In addition, automated measurements of ACW and LV could be beneficial for IOL selection: ACW is helpful in sizing anterior chamber and phakic IOLs, and there is evidence that LV could play an important role in determining effective lens position and calculating IOL power.18–23
Our results demonstrate that rates of scleral spur detection are highly variable under real-world conditions without eyelid retraction during imaging, even among experienced graders. This point, which has not been previously studied, suggests there is differing confidence among graders when deciding whether to mark a scleral spur. Based on number of scleral spurs marked, the reference grader appeared the most conservative and the expert grader the most aggressive among human graders. This trend reflects the graders’ individual thresholds for identifying scleral spurs in the context of eyelid and other imaging artefacts that partially obscure the angle recess. The reference grader, having been trained to detect scleral spurs for scientific studies, marked fewer images with artefact, whereas the expert grader, a clinician, was less conservative in marking scleral spurs in the presence of imaging artefacts. The TPR95 algorithm approximated the FNR and FPR of the expert grader (1.0% and 15.8% vs 2.9% and 17.4%). While the more conservative FPR4 algorithm had a lower FPR compared with the TPR95 algorithm (9.6% vs 17.4%), this came at the expense of a higher FNR (12.6% vs 2.9%). Despite the greater number of scleral spurs identified by the TPR95 algorithm, measurement agreement between the reference grader and both algorithms were similar. In a busy clinical environment, the higher TPR of the TPR95 algorithm is likely of greater utility than the lower FPR of the FPR4 algorithm as it is more convenient to ignore a questionably marked scleral spur than to manually mark a more obvious one.
Our study has several strengths compared with prior studies on automated scleral spur detection.26 27 First, DL algorithms maintained expert-level performance in a real-world clinical environment, defined as a diverse cohort of patients of various ages and races who were recruited from comprehensive and glaucoma clinics during routine delivery of eye care. This validation cohort and setting was completely independent from the cohort and environment in which the algorithm was developed. These findings support the generalisability and widespread implementation of DL algorithms in diverse practice settings, while prior studies that used smaller and more homogenous cohorts do not.26 27 Second, images with eyelid or other imaging artefacts were not omitted from in the validation dataset. This approach allowed us to assess variability in human grader and algorithm confidence in scleral spur detection and evaluate its effect on detection rates and measurement agreement. It also avoids introducing biases associated with analysing only a subset of images and applying arbitrary definitions of image quality that may be difficult in real-world practice environments. Third, all images were graded by a novice grader in addition to a second expert grader, which allowed us to determine that there is a benefit to using DL algorithms over a trained but inexperienced grader.
Our study also has several limitations. First, the reference grader was relatively conservative and marked fewer images than the other human graders and TPR95 algorithm. Post hoc analysis revealed eyelid or other shadowing artefacts in over 98% of these images. Second, while the overall number of images analysed was large, only 13.2% had narrow angles, which contributed to wider CIs in the ICC analysis of this subset of images. In the future, a larger cohort would be beneficial for more detailed study of narrow angles and individual sectors of the eye. Third, less than half of participants received gonioscopy; therefore, we were unable to assess algorithm performance based on gonioscopic angle status. Our OCT-based definition of narrow angles has relatively high (>80%) sensitivity and specificity for gonioscopic angle closure and our use of a quantitative OCT-based definition of narrow angles may be more appropriate for evaluating the performance of quantitative analysis algorithms; however, further study is needed to determine if human-human and human-machine limits of agreement are sufficient for detection and evaluation of narrow angles. Finally, the described algorithms are only available for images acquired on the ANTERION OCT system, and their expert-level performance would likely not generalise to images acquired on other AS-OCT devices.
In conclusion, DL algorithms provide expert-level scleral spur detection and biometric analysis in a large set of AS-OCT images from a diverse clinical cohort. There appears to be a benefit to using the TPR95 algorithm compared with grading by a novice in terms of the number of scleral spurs identified and the accuracy of biometric measurements. This study supports the implementation of the TPR95 algorithm in diverse patient populations and real-world practice settings. While this technology has the potential to expand the clinical utility of AS-OCT imaging and modernise the care of ocular conditions dependent on accurate anterior segment biometry, further studies are needed to help guide its use in routine clinical practice and decision-making processes.
Data availability statement
Data are available on reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
The study was approved by the University of Southern California (USC) Institutional Review Board (IRB number HS-17-00684). All study procedures adhered to the recommendations of the Declaration of Helsinki. Written informed consent was obtained from all participants.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
X @BenXuLab
Contributors GAA, AAP, ASH and BYX evaluated participants and AS-OCT images. KB, GAA, MC, BB and BYX completed the statistical analysis. KB and BYX prepared the manuscript. MS contributed to DL algorithm design and its description in the Methods section. XX contributed background information and to study design. BYX is guarantor.
Funding This work was supported by grant K23 EY029763 from the National Eye Institute, National Institute of Health, Bethesda, Maryland and an unrestricted grant to the Department of Ophthalmology from Research to Prevent Blindness, New York, New York, USA.
Competing interests BYX and ASH receive research support from Heidelberg Engineering. MS is employed by Heidelberg Engineering.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.