Aim: To examine the level of agreement between clinicians in assessing progressive deterioration in visual field series using two different methods of analysis.
Methods: Each visual field series satisfied the following criteria: more than 19 reliable fields, patient age over 40 years, macular threshold at least 30 dB. The first three fields in each series were excluded to minimise learning effects: the following 16 were studied. Five expert clinicians assessed the progression status of each series using both standard Humphrey printouts and pointwise linear regression (progressor). The level of agreement between the clinicians was evaluated using a weighted kappa statistic.
Results: A total of 432 tests comprising 27 visual field series of 16 tests each were assessed by the clinicians. The level of agreement on progression status between the clinicians was always higher when they used progressor (median kappa = 0.59) than when they used Humphrey printouts (median kappa = 0.32). This was statistically significant (p = 0.006, Wilcoxon matched pairs signed rank sum test).
Conclusions: Agreement between expert clinicians about visual field progression status is poor when standard Humphrey printouts are used, even when the field series studied are long and consist solely of reliable fields. Under these ideal conditions, clinicians agree more closely about patients’ visual field progression status when using progressor than when inspecting series of Humphrey printouts.
- visual field
- progressor program
Statistics from Altmetric.com
“At present, there is no generally accepted technique for detecting change in the visual field over time using automated perimetry … most clinicians who use automated perimetry are probably employing simple visual inspection of the perimetric data to diagnose progressive visual field loss in patients with glaucoma.”1
Although these statements were made over a decade ago they are probably still true today, even though the knowledge of whether a patient’s glaucoma is progressive or stable remains central to the management of the condition. Several scoring systems have been devised in order to identify visual field progression for the purposes of research2–5 but none has found widespread acceptance in general clinical practice. However, unaided clinical judgment is inconsistent: even expert observers show considerable disagreement about whether a given visual field series signifies progression or stability.1 One possible reason for this is that the standard output of most automated perimeters provides inadequate information relating to progression or stability. Thus, when the clinician is attempting to decide whether or not a given series of outputs constitutes progressive disease, the task involves manually comparing the decibel sensitivity values (or processed versions thereof) or graphical plots for all fields in the series. This task is further complicated by the contributions of “within test” variability (short term fluctuation)6 and “between test” variability (long term fluctuation),7 both of which are known to be increased above normal in glaucoma.8 For these reasons, and because the grids of numbers produced by automated perimeters are easily amenable to numerical analysis, a great variety of software and statistical approaches have been taken to aid in the determination of visual field progression in glaucoma. One set of methods rely on estimates of change in summary measures of the field such as regression analysis of the mean defect value,9 mean deviation,10 other global measures,10 measurement of whole field and quadrantic sensitivity losses,11 and trend and regression analysis of various estimates of the sensitivity of the whole field or parts of it.12–14 However, the analysis of summary measures, whether based on the whole field or on clusters of points within it, has been found to be “remarkably poor”15 and “of little value”16 in detecting glaucomatous change. Summary measures largely or completely ignore the detailed spatial information contained within computerised field tests and are insensitive to early localised change.17 Furthermore, different regions of the visual field may deteriorate at different rates.14,18,19
progressor, which has been described fully previously in the BJO,20 avoids these problems by performing a point by point linear regression analysis of sensitivity on time for the whole visual field series. This technique has been used for several years to investigate glaucomatous visual field change21,22 and has recently been re-examined.10 The pointwise linear model has been demonstrated to provide a valid framework for detecting and forecasting glaucomatous loss.23 Using equivalent progression criteria, progressor has been found to compare favourably with other pointwise analyses such as the statpac 2 glaucoma change probability analysis24 for the Humphrey field analyser (Humphrey Instruments Inc, Dublin, CA, USA).25 Pointwise linear regression has been found to agree more closely with expert clinical judgment about progression status than glaucoma change probability analysis.26
Since there is no general consensus on what pointwise value of regression slope and p value constitutes progression, or whether there should be a requirement for contiguous points to show this behaviour and whether it should be maintained in subsequent fields, at present progressor remains a subjective analysis. For this reason, and because there is no universally accepted gold standard for visual field progression, this study investigates the usefulness of progressor by determining the level of agreement between expert observers using both progressor and standard clinical techniques (manually comparing serial printouts from an automated perimeter).
Selection of visual field data
The visual field series presented to the five clinicians in the study for evaluation were drawn from the clinical visual field database of the glaucoma service at Moorfields Eye Hospital, a tertiary referral centre also serving the local community. At the time of the study this database contained Humphrey visual field tests performed between 16 January 1985 and 6 August 1997. All data were the result of standard clinical testing on Humphrey model 630 perimeters. The following filters were applied to the database in order to obtain a group of visual field series such as those commonly encountered in clinical practice when the degree of visual field deterioration must be estimated, but without specifying the presence, absence, or nature of progression in the visual field series to be studied based on any a priori assumptions of what constitutes progression:
Patient age greater than 40 years
Each field series consisted solely of tests using the Humphrey 24-2 and 30-2 test grid patterns with standard 4–2 dB staircase thresholding strategy
A white stimulus of Goldmann size III was used throughout
Each field test was required to meet reliability criteria as set by the Humphrey perimeter: fewer than 20% fixation losses, fewer than 33% false positive and fewer than 33% false negative responses
The macular threshold of each field test was required to be at least 30 decibels. This criterion was introduced so as to rule out visual field tests with defects arising from significant media opacities or macular disease
Each field series included in the study contained at least 19 fields satisfying the foregoing criteria. The first three fields in each series were ignored to obviate learning effects27: the following 16 fields were presented to the clinicians for analysis
If both eyes of a particular patient fulfilled the inclusion criteria, one eye was selected at random.
Analysis of visual field data
Each of the visual field series was assessed by all five clinicians (AMcN; MW, DK, DG-H, RH). All five clinicians were glaucoma specialists. They were all experienced in the interpretation of series of standard Humphrey visual field printouts (raw sensitivity values, the grey scale plot, total deviation values and plot, pattern deviation values and plot, global indices, and glaucoma hemifield test) in order to determine progression status, both from clinical practice and from research into visual field deterioration. At the time of the study two of the clinicians (AMcN and RH) were familiar with progressor: the others had a working knowledge from the published literature. Initially, the clinicians were asked to examine the visual field series presented as standard Humphrey printouts. Although each patient’s visual fields were, of course, presented in chronological order within a field series, the field series themselves were presented in a random order determined by a software random number generator. Clinicians were asked to use their judgment to assess each field series and to assign it to one of four categories: definitely stable, probably stable, probably progressing, or definitely progressing. Then, the clinicians were asked to use their judgment to rate the field series using progressor. For this analysis the field series were presented in a different random order from that used for the Humphrey printouts. (Naturally, each patient’s visual fields remained in chronological order within a given field series.)
The clinicians were allowed to use any of the options within progressor which they considered were necessary for an accurate analysis. These included the cumulative graphical output, a choice of progression criteria for slope and p value, grey scale plots, animation analysis, and Gaussian filtering. The latter has been shown to reduce long term fluctuation28 without delay in the detection of visual field deterioration.29 The clinicians were asked to categorise the field series using progressor into the same four categories previously described for use with the Humphrey printouts. All the clinicians evaluating the visual field series were aware of the purpose of the study.
In order to measure of the level of intraobserver agreement two of the clinicians were asked to re-examine all the field series using both Humphrey printouts and progressor 3 months after their original analysis. These clinicians were given no warning that intraobserver variation was to be measured as part of the study. Visual field series were evaluated independently by each clinician. All field series were anonymised. For the intraobserver agreement study clinicians were masked to their previous evaluations.
The level of agreement among five individuals (A, B, C, D, E) may be expressed as the levels of agreement of the 10 possible pairs (A&B, A&C, A&D, A&E, B&C, B&D, B&E, C&D, C&E, D&E). The level of interobserver agreement was measured using the weighted kappa statistic30 for all 10 possible pairs of clinicians. Weighted kappa is an appropriate chance adjusted measure of agreement between two observers when there are more than two ordered categories of classification. The statistic, which ranges from 0 (agreement no better than chance) to 1 (perfect agreement), gives partial credit for partial agreement in accordance with a linear weighting scheme. Weighted kappa is obtained by giving weights to the frequencies in each cell of the table according to their distance from the diagonal that indicates agreement. A simple linear weighting scheme was adopted. For the cell in row i and column j, with observed frequency fij, the weight was calculated as
Additionally, the level of the observer concordance was assessed: the proportion of the observers agreeing about the category of a particular field series was measured. Similar statistics were used to measure the level of intraobserver agreement.
All statistical analysis was performed using the software package s-plus 3.2 for Windows (StatSci Europe, MathSoft Inc, Oxford, UK).
Twenty seven visual field series were assessed by the clinicians. Since each series consisted of 16 visual field tests each clinician assessed a total of 432 visual field tests. The median age of the patients used in this study at the first visual field in the series analysed was 61 years (range 44–72 years). The median Humphrey mean deviation (MD) of the first visual field in each series analysed was −7.7 dB (range −0.1 to −14.8 dB). The median length of follow up for the visual field series was 5.7 years (range 3.3–7.7 years).
Tables of agreement were produced for all 10 possible pairs of the clinicians. For example, Table 1 shows the classification of the 27 field series by one pair of the clinicians (A and B) when using the Humphrey printouts. The weighted kappa value for the agreement exhibited is 0.28 (SE 0.08). Table 2 shows the results for the same pair of clinicians when using progressor analysis. The weighted kappa value for the agreement between the two clinicians has increased substantially to 0.63 (SE 0.13). Approximate 95% confidence intervals may be calculated from 2 × SE (kappa) according to Fleiss.30
The weighted kappa values for all the pairs of clinicians using the different methods of analysis are shown graphically in Figure 1. All pairs of clinicians demonstrated greater weighted kappa values for agreement when progressor was used compared to Humphrey printouts. The median weighted kappa value for all pairs of clinicians using progressor was 0.59 compared to a median of 0.32 when using Humphrey printouts. Qualitative interpretation of kappa levels of agreement are imprecise but guidelines have been published.31 Generally a value less than 0.40 indicates only “slight” agreement and a value above 0.60 indicates “substantial” agreement. The approximate 95% confidence intervals for the weighted kappa values of progressor and Humphrey printouts for individual pairs of clinicians overlap suggesting that statistical significance has not been achieved for these individual cases. However, the finding that all pairs of clinicians consistently demonstrated higher levels of agreement when using progressor compared to Humphrey printouts is statistically significant at the 5% level (p = 0.006, Wilcoxon matched pairs signed rank sum test.)
Tables 3 and 4 show the concordance between the clinicians in classifying the visual field series using the different methods. One hundred per cent concordance was defined as all five clinicians either agreeing a series was progressing (either probably or definitely) or all five clinicians agreeing a series was stable (either probably or definitely). This rate of concordance was achieved in only 10 of the 27 series (37%) when the clinicians used Humphrey printouts. In comparison, the 100% rate of concordance was doubled when the clinicians used progressor: all five clinicians had identical opinions on 20 of the 27 series (74%).
Tables 3 and 4 also show that 13 of the 27 series (48%) were classified as progressing by the majority (at least three or 60%) of the clinicians using Humphrey printouts. In contrast, 19 of the 27 series (70%) were classified as progressing by the majority of the clinicians when using progressor.
Intraobserver agreement was assessed in two of the clinicians as described in the Methods section. The weighted kappa statistic for intraobserver agreement for clinician A was 0.43 (SE 0.11) using Humphrey printouts and 0.71 (SE 0.13) when using progressor. The weighted kappa statistic for intraobserver agreement for clinician B was 0.60 (SE 0.12) using Humphrey printouts and 0.83 (SE 0.09) when using progressor.
It is not particularly surprising that only slight interobserver agreement was found when standard Humphrey printouts were used as a basis for a decision about glaucomatous visual field progression (median weighted kappa value 0.32). In a previous study, Werner and colleagues measured the level of agreement between six experienced observers in rating the progression status of automated visual field series from 30 glaucoma patients.1 Although the weighted kappa value was not reported in that paper, it has been calculated subsequently as 0.402.32 This is comparable to our findings. It is interesting that Werner and colleagues found a slightly higher level of agreement than ours, even though in their study the mean number of visual fields in each series was 6.3 whereas in the present study it was fixed at 16. This suggests that the clinicians in the present study did not benefit from being asked to analyse relatively more visual field data per subject. In fact, longer periods of visual field follow up may actually make the task of deciding about progression status more difficult and complex, when this task is based on standard automated visual field printouts. Werner and colleagues found that all the observers agreed about progression status in 11 of their 30 subjects. The present study found that there was complete agreement about 10 of the 27 subjects. Although Werner and colleagues used six observers whereas the present study used five these results appear comparable. Once again, the level of agreement between observers in rating visual field progression based on standard automated perimeter output does not seem to have been influenced by the relatively longer follow up in the present study.
A consistently higher level of interobserver and intraobserver agreement was found when progressor was used to rate progression status rather than standard perimeter output. This suggests that clinicians are better able to make meaningful, systematic decisions about visual field progression status when using progressor rather than standard automated perimeter output. Chauhan and colleagues conducted a similar study to the present one in which five observers rated the progression status of 32 visual field series using a computer animated graphics technique which corrected for test-retest variability in order to aid the recognition of progression.32 They found a weighted kappa value of 0.572 which is very similar to the figure of 0.59 obtained in the present study. Chauhan and colleagues found that at least four out of five observers (80% or greater concordance) agreed on progression status in 27 out of 32 visual field series (84%). This is remarkably similar to our findings of 80% or better concordance on 23 out of 27 field series (85%). The level of complete agreement between all five observers (100% concordance) is, however, greater in the present study: 74% compared to Chauhan and colleagues’ figure of 56%. Both these studies suggest that the level of agreement about progression status rises when clinicians analyse displays of visual field series which highlight information pertaining to progression which is not obvious in the standard perimeter outputs.
The number of visual field tests in each series in our study was fixed at 16. This filter was applied to the database of visual field series in order both to optimise the chances of agreement between observers and to produce a similar number of visual field series to those used by Werner and colleagues (30 series) and Chauhan and colleagues (32 series) without resorting to random or other potentially biased methods of subselection. As a consequence of this procedure the length of visual field series used in the present study was longer than those used by Werner and colleagues (mean series length 6.3) or Chauhan and colleagues (median series length 7.0). As discussed previously, the difference in series length does not appear to have greatly influenced the level of agreement when observers are considering standard Humphrey printouts. Levels of agreement also seem similar between progressor and the technique used by Chauhan and colleagues despite the difference in series length, but it is difficult to speculate what effect shorter series length might have on the level of agreement using progressor. As a further consequence of the filtering procedure, field series with relatively frequent (approximately 4 monthly) tests were obtained. This frequency of testing is undertaken when progression is suspected33 and for the routine follow up of normal tension glaucoma patients in our department: these groups of patients are likely to be relatively over-represented in the study sample.
The requirement for the macular threshold of each visual field test in a series to be at least 30 dB almost certainly affected the results of this study. Had this criterion not been imposed, the clinicians may have shown less agreement on visual field series progression status, or they may have shown higher levels of agreement over visual field deterioration which was not in fact truly glaucomatous in nature. Media opacities in particular are an important source of error in studies which specify visual field deterioration in glaucoma as an outcome: various strategies have been employed to mitigate against the effects of media opacities, including those reported in the Collaborative Normal Tension Glaucoma Study34 and the Early Manifest Glaucoma Trial.3 In the present study, however, visual field progression was not an outcome measure and the inclusion of visual field series with significant contributions from media and macular pathology would have presented our clinicians with an inappropriate task: in the clinical scenario, the task of judging progression in these cases is informed by clinical examination of the patient. The 30 dB macular threshold criterion may have resulted in visual field tests showing advanced glaucomatous damage being excluded from the study. However, many of the test locations in such cases fall outside the dynamic range of the perimeter and thus yield little useful information about progression status. Strictly, though, the results of this study may only be generalised to similar clinical scenarios (that is, learning effects accounted for, all tests reliable, long series with good macular sensitivity threshold).
The clinicians in the present study were given no explicit guidelines upon which to base the diagnosis of stability or progression for each visual field series. They were merely asked to use their clinical judgment based on the perimetric output. Thus, it is possible that a different group of observers might have used different “personal progression criteria” which would have led to different results. In the absence of a gold standard for the diagnosis of glaucomatous visual field progression, however, the analytical task presented to the clinicians in the present study resembles the conditions encountered in clinical practice more closely than if predetermined progression criteria had been imposed. Although the level of agreement between the observers was similar for progressor and for the glaucoma change analysis described by Chauhan and colleagues, speculation about the personal progression criteria used by the observers in the two studies is difficult, since the analytical task of diagnosing progression or stability using progressor is quite different from that using the technique described by Chauhan and colleagues.
The use of progressor led to the clinicians classifying a higher proportion of the field series as progressing than when the standard printouts were used. In the absence of an external gold standard for visual field progression, this may indicate either an increased sensitivity in the detection of progression or alternatively a reduction in specificity. An approach using software simulation of visual field data35 has suggested that pointwise linear regression has a sensitivity and specificity of over 90% to detect significant rates of deterioration of 2.5 dB per year or worse, when 10 fields per series are available for study. However, if the number of fields available is limited to five, the sensitivity is reduced to around 25% (though the specificity is maintained owing to the requirement for a statistically significant deterioration). This lack of sensitivity when few fields are available for analysis is common to all forms of progression assessment. There is some evidence that the reliability of progressor in detecting visual field deterioration in glaucoma compares favourably with other automated analyses such as statpac 2.25 Furthermore, more visual field series were classified as stable with 100% concordance using progressor than using standard Humphrey printouts: in only one of the eight field series classified as stable using progressor was there less than 100% concordance. It is also possible that the higher levels of agreement found with progressor are partly explained by the higher proportion of series classified as progressing using progressor, since this in itself will tend to generate more agreement. It is difficult to estimate the size of this effect since agreement and decisions about progression status cannot be studied separately.
It is likely that new developments and refinements in the area of computer assisted diagnosis of visual field progression in glaucoma will yield higher levels of agreement between clinicians in the future. However, the interpretation of any algorithm and its application to patient management within the clinical context will remain a subjective matter of clinical expertise.
Supported in part by grants from the International Glaucoma Association, the Royal National Institute for the Blind and the Medical Research Council.
Disclosure of interest: ACV, FWF and RAH are developers of the progressor software used in this study. The other authors have no commercial interest.