Article Text
Abstract
Objectives To investigate inter-reader agreement on five severity levels of central vascular changes (none, mild, moderate, severe pre-plus disease, plus disease) and aggressive posterior retinopathy of prematurity (ROP), and to see whether an unintended shift in indication for treatment occurred.
Methods Four international ROP readers participated. Before the grading of the photographs, the readers were informed that a high proportion of advanced ROP cases were included. In total, 243 photographs/948 quadrants were available from 136 infants. As a standard series of photographs was available, grading was performed under optimised conditions.
Results The four readers agreed on the quadrant scores of only 70 (7.38%) of the 948 quadrants—that is, on 1, 5, 15, 4 and 45 quadrants for scores 0, 1, 2, 3 and 4, respectively. The mean scores differed systematically between the readers (permutation test, p<0.0001). Agreement on presence of aggressive posterior ROP from all four readers was not obtained for any of the photographs. Readers scored plus disease in at least two quadrants in 95.5% of the eyes for which treatment was indicated. All four readers agreed on the scoring of indication for treatment for 195 eyes (80.2%); however, treatment was only recommended in 18 (7.4%) eyes. One reader was found to differ systematically from the others in indicating treatment (Rasch analysis; p=0.0001). Finally, a significant shift in indication for treatment occurred between birth period 2000–2002 and 2003–2006 (Mann–Whitney rank sum test, p<0.001).
Conclusions Inter-reader agreement on central vascular changes is poor, especially when based on more than two rating categories. The subjective nature of diagnosing such vascular changes possibly resulted in earlier treatment of preterm infants in Denmark over the entire study period (1997–2006). The recent increased incidence of treated infants in Denmark is, at least in part, explained by a significant shift in indication for treatment.
- Preterm delivery
- retinopathy of prematurity
- plus disease
- pre-plus disease
- aggressive posterior retinopathy of prematurity
- RetCam photographs
- inter-reader agreement
- indication for treatment
- diagnostic tests/investigation
- epidemiology
- public health
- imaging
- telemedicine
- child health (paediatrics)
- embryology and development
- macula
Statistics from Altmetric.com
- Preterm delivery
- retinopathy of prematurity
- plus disease
- pre-plus disease
- aggressive posterior retinopathy of prematurity
- RetCam photographs
- inter-reader agreement
- indication for treatment
- diagnostic tests/investigation
- epidemiology
- public health
- imaging
- telemedicine
- child health (paediatrics)
- embryology and development
- macula
Introduction
Plus disease is an important prognostic indicator in advanced retinopathy of prematurity (ROP).1 However, recognition of the pathology is highly subjective,2–5 possibly because plus disease was defined in the past according to a single photographic standard.6 7 The revisited International Classification of ROP (ICROP) attempted to reduce the subjective nature of the evaluation of central vascular changes by supplying additional photographs defining plus disease.8 9 However, as the presence of plus disease is now the main determinant for ROP treatment, these concerns are more serious and should be addressed.2 5 9 10 Inconsistencies in diagnosing plus disease can ultimately lead to over- or under-treatment of premature infants.2 5
Two new terms were defined by the revisited ICROP: pre-plus disease, a predictor of sight-threatening plus disease development; and aggressive posterior ROP (APROP), at high risk of rapidly developing into retinal detachment.9 11 Consistent recognition of both these pathologies is desirable.
The present study is based on a large series of high-quality RetCam images from a selected preterm population born during 1997–2006. A high proportion of these photographs were of eyes of infants treated for ROP. Four ROP experts, blinded to the infants' backgrounds, determined the severity of posterior vascular changes shown on the photographs. The aim of this study was to investigate the readers' consensus on five different levels of central vascular changes and of APROP, and also to discover whether an unintended shift in indication for treatment occurred during the study period. Such a shift may, at least partly, explain the increase in incidence of treatment-requiring ROP cases in Denmark.12
Materials and methods
Participants
The four ROP expert readers in this study were practising paediatric ophthalmologists or vitreoretinal specialists, from the UK, Sweden and eastern and western Denmark. They all had years of experience in ROP screening and treatment, but their experience in RetCam evaluation was somewhat variable. Before grading the photographs, they were blinded to any information on the infants, except that they knew that a high proportion of advanced ROP cases were included.
Study population
The National University Hospital, Rigshospitalet (RH), is in charge of all retinal ablative therapy for ROP in Denmark. All participating infants, born during 1997–2006, were referred with advanced ROP for evaluation. The treatment criterion was presence of classical threshold ROP (T-ROP).6 As part of the examination performed under general anaesthesia, fundus photographs were taken using a RetCam device (RetCam I-II; Clarity Medical Systems, Pleasanton, California, USA).
During the study period, 171 infants/342 eyes were referred for evaluation at RH; 161 infants (94.2%) received treatment. Photographs were available for 260 of the 342 eyes. Of the 82 photographs missing, 17.1% (51/298 eyes) were of infants born in 2000–2006, while 70.5% (31/44 eyes) were of infants born in 1997–1999. One photograph was excluded because a more advanced ROP stage than T-ROP was present. Photographs of another 16 eyes were also excluded because of poor quality. Finally, the total number of photographs to be graded in this study was 136 infants/243 eyes.
Photographs
Microsoft Office Picture Manager was used to crop the original photographs to present only central vascular changes. The photographs were arranged with the optic disc and foveal area in the horizontal plane and organised in a successive series to make their rating easier. The readers were supplied with a standard photograph series showing images considered to be representative of the changes needed for scoring presence of mild, moderate and severe pre-plus disease, plus disease and APROP. The standard photographs showing pre-plus disease (in this study labelled ‘severe pre-plus’), plus disease and APROP were originals from the revisited ICROP.9 As standard photographs presenting milder vascular incompensation than severe pre-plus were not available from the revisited ICROP, they were purchased from the local RH RetCam database. Mild disease was represented by a photograph showing close to normal changes, while moderate showed vascular incompensation between mild and severe pre-plus disease.
Rating procedure
The readers were asked to present the standard photograph series during rating procedures. A scoring spreadsheet was used including columns to indicate: the series number of the image; the vessel score for each of the four quadrants; whether treatment for the eye was indicated; whether APROP was present; and poor image quality. The four columns for each quadrant were scored as follows: 0, no vascular changes; 1, mild pre-plus; 2, moderate pre-plus; 3, severe pre-plus; 4, plus disease; 9, cannot be scored. The two columns representing ‘indication for treatment’ and ‘presence of APROP’ for each eye were scored as: ‘y’ if yes, ‘n’ if no, or ‘9’ for cannot be scored. The readers were asked to ‘indicate treatment’ when plus disease was present in two or more quadrants (if sufficient peripheral pathology existed). The quality of the RetCam photographs was coded as follows: 1, the quality allowed precise scoring; 2, the quality was somewhat lacking, but sufficient for scoring; 3, the quality was too poor to allow scoring.
Data analysis
One-way analysis of variance was used to compare continuous variables involving three groups. The χ2 test was used to compare categorical variables. In order to investigate inter-reader differences in quadrant scores, mean scores and mean score differences were computed for all pairs of readers. A non-parametric permutation test was applied to assess the overall differences among those readers.13 Confidence intervals were based on 10 000 bootstrap samples.14 For the outcome ‘indication for treatment’, a random effects logistic regression model (the Rasch model) was applied.15 16 In this model, the probability that a particular eye is given a score of indication for treatment, is modelled according to a logistic regression depending on the identity of the reader (fixed effect) and the disease severity of the eye under consideration (a latent variable, ie, a random effect): log (odds for treatment of T-ROP) = αreader + βseverity of disease. The model assumes that severity of disease is an unobserved variable which follows a normal distribution in the population of photographs and can thus be measured in standard deviations (SDs), with 0 SD corresponding to the average severity of disease (<−2 SD for the 2.5% least severe cases and >+2 SD for the 2.5% most severe cases). Inter-reader differences were reported as ORs corresponding to the differences in probability of the same eyes having positive scores for indication for treatment from two different readers. The assumption that the unobserved variable, severity of disease, is normally distributed in the sampled eyes is a fair one, as seen from the distribution of quadrant scores. Throughout this study a CI of 95% and a significance level of 5% were used. Rasch models were analysed with PROC GLIMMIX in SAS V.9.2. The descriptive statistics and non-parametric analysis were performed in R (R Development Core Team, Vienna, Austria).
Results
The birth characteristics of the participating infants are presented in table 1. Only gestational age showed borderline significant difference between the birth periods (ANOVA: p=0.057).
Few photographs of eyes were marked as ‘cannot be scored’ by the readers (table 2). Three readers were unable to grade 20 of the photos (8.3%). Of the 972 quadrants (243 eyes) included, 948 (97.5%) were scored by all four readers.
Reader 2 scored the central vascular changes as milder than the other readers. The scores of reader 4 were more widely distributed than those of readers 1 and 3 (figure 1). All four readers agreed on the scores of 70 (7.4%) of the 948 quadrants (ie, on 1, 5, 15, 4 and 45 quadrants for scores 0, 1, 2, 3 and 4, respectively). Three readers agreed on the scores of 311 (32.8%) of the 948 quadrants (ie, 10, 71, 146, 45 and 39 quadrants for scores 0, 1, 2, 3 and 4, respectively). Finally, two or fewer readers agreed on the scores in the remaining 567 (59.8%) of the 948 quadrants. The readers are ranked as reader 1, reader 3, reader 4, and reader 2 in terms of highest to lowest mean score value (table 3).
This ranking is also reflected in the values of mean differences between each pair of readers (table 4).
The mean scores differ significantly between the readers (permutation test, p<0.0001).
The presence of APROP was identified in only a few cases (0.4–1.7%) by each reader (table 2). The presence of APROP was determined by all readers in photographs of 201 eyes (82.7%). Of these, no readers indicated APROP in 197 eyes (98.0%), one reader indicated APROP in four cases, and two or more readers indicated APROP in no cases. Of the 42 eyes that were only evaluated by three readers, no readers indicated APROP in 36 eyes, one reader indicated APROP in five eyes, and two readers indicated APROP in one eye.
Of the entire study sample, 94.2% of the eyes were treated for ROP during the neonatal stay. The corresponding numbers were 86.4% and 94.1% for the birth periods 2000–2002 and 2003–2006, respectively.
Treatment was indicated by no readers in the case of 177 eyes (72.8%), one reader in the case of 19 eyes (7.8%), two readers in the case of 12 eyes (4.9%), three readers in the case of 17 eyes (7.0%), and four readers in the case of 18 eyes (7.4%). Of the eyes for which treatment was indicated, 95.5% were scored as plus disease in at least two quadrants by the readers. The remaining cases were scored as severe pre-plus disease in at least one quadrant (29 cases), as arteriolar or venous asymmetry (10 cases), or plus disease in only one quadrant (four cases).
The four readers differed with respect to indicating ROP treatment (figure 2). The number of indications for treatment varies slightly among readers 1, 3 and 4; however, reader 2 was the most reluctant to indicate treatment than the others (table 2). In fact, the ORs for indication of treatment of a given eye when comparing pairs of readers show that only reader 2 differed systematically from the others (p=0.0001, in each comparison) (table 5).
The readers agreed that 28.1% and 82.0% of the eyes from the birth periods 2000–2002 and 2003–2006, respectively, should not be treated (figure 3). The readers indicated treatment significantly more often in the birth period 2000–2002 (mean 2.2, SD 1.7) than 2003–2006 (mean 0.4, SD 0.9) (Mann–Whitney rank sum test, p<0.001). Photographs of eyes from infants born in 1997–1999 were sparse (29.5% eyes), and therefore these infants were excluded from the above analysis.
Discussion
We here show that, when tested with five categories of vascular changes, readers rarely agreed (7.4%) on the severity of central vascular changes. Previous studies have reported higher rates of inter-reader agreement, ranging from 21% to 73% for two severity levels, and from 12% to 69% for three severity levels.2 5 The variation in inter-reader agreement results from the subjective nature of vessel diagnosis. Both the appearance of mid-periphery vascular branching and asymmetry of vascular changes are possible contributing factors. Participation of more readers and availability of more scoring categories increase the inter-reader variability. In this study, the subjective nature of the plus diagnosis becomes very pronounced when a higher number of scoring categories is available.
Inter-reader agreement on the presence of APROP was also investigated. According to Danish experience, APROP is rare. In this study, only 0.4–1.7% of eyes were diagnosed with APROP by the readers, and inter-reader variability was so large that not once did three or more readers diagnose APROP in the same eye. No other reports on inter-reader agreement on APROP are available, and comparison with previous studies is therefore not possible. The lack of inter-reader agreement reflects the subjective nature of diagnosis, possibly resulting from the broad definition.9 17 This raises concerns, as there is a high risk of undetected APROP progressing rapidly to total retinal detachment and childhood blindness.9 10
The rate of agreement was much higher (80.2%) when the readers were asked whether treatment was indicated. In most cases (95.5%), readers scored plus disease in at least two quadrants; thus, all attempted to adhere to the international recommendations on treatment.10 For most infants (72.8%), the readers agreed on no treatment indication, in spite of knowing in advance that the study sample contained a high proportion of advanced ROP cases. A large proportion (94.2%) of these eyes was treated during the neonatal period. In spite of the high rate of agreement, one reader was systematically more reluctant to indicate treatment than the others, possibly because of geographical differences. If all readers had presented such restrictive scores, the above discrepancy would have been even larger. The same would happen if the indications were based on four quadrants of plus disease, instead of two, as was done by the treating clinicians. The relatively high rate of agreement among the readers suggests that several of the infants were treated on prethreshold disease. It is important to bear in mind that the readers and the treating clinicians worked with different levels of information, the former assessing central vascular changes, and the latter assessing the entire retina. Only 31.1% of eyes with stage 3 ROP developed plus disease. Therefore treatment of all infants with stage 3 ROP, with inconsistent recognition of plus disease, would result in more treated infants.18 Similarly, inconsistent recognition of periphery changes per se may also have contributed to discrepancies between experts.19
The recent increased incidence of treatment-requiring ROP in Denmark during the birth period 1996–2005 was especially pronounced from 2003.12 Neither the hospital records nor the referral pattern across the country had suggested a shift in treatment indication.12 However, in the present study, based on blinded interpreters, a significant shift in treatment indication occurred from 2003. The appearance of the multicentre ETROP study in 2003, recommending early treatment, along with a new surgeon responsible for making treatment indications in the same year, may possibly have resulted in an unintended shift in treatment indication.10 Therefore a combination of treatment of prethreshold disease and inconsistent recognition of plus disease may explain the increase. A recent decline in the number of blind and visually impaired children registered due to ROP sequelae in Denmark supports these speculations.12
This study has certain strengths. First, the use of a series of high-quality photographs as the standard reference for scoring improved the rating procedure and possibly reduced inter-reader variability. Second, the use of images instead of indirect ophthalmoscopy for evaluation ensured that all readers scored the same images. This approach guaranteed that the precision of the evaluation was not determined by the cooperation of the infant with the examination.
This study also has some limitations. First, the luminance/resolution of the reader's computer monitor display was not standardised. No instructions were given to the reader on how to rate vessel asymmetry in the quadrants/eyes. Both could have increased inter-reader variability. Second, being used to evaluating vascular changes with an indirect ophthalmoscope, readers may have had difficulty assessing the severity of the changes when presented on RetCam photographs. Finally, the revisited ICROP was the only standard reference to guide the readers, and standardised protocols with more precise quantification of central vascular changes are clearly warranted. In practice, it may be difficult to develop a more refined protocol. Instead computer-automated and computer-assisted systems may be useful tools.20–25
In summary, we here show that inter-reader agreement on central vascular changes is poor, especially when based on more than two severity levels. The subjective nature of diagnosing such changes may have resulted in more preterm infants being treated in Denmark during the study period (1997–2006). In addition, a significant shift in indication for treatment occurred from 2003. This explains, at least in part, the significantly increased incidence of treated infants occurring during the latter period. Further refinement of the revisited ICROP guidelines, or even better, availability of a reliable and valid computer-based image analysis system to further standardise quantification of central vascular changes is urgently needed.
References
Footnotes
Funding This work was supported by grants from the Danish Eye Health Society, Bagenkop Nielsens Myopi- and Eye Foundation, VELUX Foundation, Aase and Ejnar Danielsens Foundation, Dagmar Marshalls Foundation, Direktør Jacob Madsen and Hustru Olga Madsens Foundation, P. A. Messerschmidt and Hustrus Foundation.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.