Article Text


Factors affecting the test-retest variability of Heidelberg retina tomograph and Heidelberg retina tomograph II measurements
  1. N G Strouthidis1,2,
  2. E T White1,
  3. V M F Owen3,
  4. T A Ho1,
  5. C J Hammond4,
  6. D F Garway-Heath1,2
  1. 1Glaucoma Research Unit, Moorfields Eye Hospital, London, UK
  2. 2Institute of Ophthalmology, London, UK
  3. 3School of Biomedical and Natural Sciences, Nottingham Trent University, Nottingham, UK
  4. 4West Kent Eye Centre, Princess Royal Hospital, Orpington, Kent, UK
  1. Correspondence to: D F Garway-Heath MD, FRCOphth, Glaucoma Research Unit, Moorfields Eye Hospital, 162 City Road, London EC1V 2PD, UK;


Aims: To evaluate the test-retest variability of stereometric parameter measurements made with the Heidelberg retina tomograph (HRT) and Heidelberg retina tomograph-II (HRT-II), and to establish which parameter(s) provided the most repeatable and reliable measurements with both devices. An investigation into the factors affecting the repeatability of the measurements of this parameter(s) was conducted.

Methods: 43 ocular hypertensive and 31 glaucoma subjects were recruited to a test-retest study. One eye from each subject underwent HRT and HRT-II imaging by two observers on each of two occasions within 6 weeks of each other. Lens grading was carried out by LOCS III grading and Scheimpflug camera generated densitogram analysis.

Results: Rim area (RA) and mean cup depth measurements were found to be least variable. Both inter-test reference height difference and image quality had a strong relation (R2>0.5, p<0.0001) with inter-test RA difference and, together, are responsible for 70% of RA measurement variability. Image quality was influenced by lens opacity, cylindrical error, and age. Inter-test RA measurement differences were unrelated to the observer or visit interval.

Conclusions: RA represents an appropriate measure for monitoring glaucomatous progression. Reference height difference and image quality were the factors that most influenced RA measurement variability. Image analysis strategies that address these factors may reduce test-retest variability.

Statistics from

The ability to evaluate disease progression is of great importance in the management of patients with primary open angle glaucoma (POAG). Likewise, the detection of progression in subjects with ocular hypertension (OHT) could enable early intervention and delay the onset of visual field (VF) loss.1,2 POAG is a chronic, progressive disorder in which damage to retinal ganglion cell axons results in characteristic VF defects. However, there is evidence to suggest that structural changes to the optic nerve head (ONH) may occur before identifiable VF changes are detected by standard perimetry.3–7 Clinical examination of the ONH is of limited value in detecting progression because of interobserver variation.8 An objective approach, longitudinal quantitative imaging of the ONH, may be more useful for the detection of glaucomatous progression.

One imaging method is scanning laser tomography performed with the Heidelberg retina tomograph (HRT) and, more recently, the Heidelberg retina tomograph II (HRT-II). The HRT-II is intended for use in the clinical milieu by various operators, some of whom may not possess the level of experience required for the successful use of the previous model.9 Both devices generate three dimensional mean topography images from which a range of stereometric parameter values can be calculated. These parameters can be measured over time to detect progression. It is useful to estimate the test-retest repeatability of stereometric parameter measurements so that changes caused by disease progression and not by measurement error can be identified correctly.

The purpose of this study was to evaluate the test-retest variability of stereometric parameter measurements made with the HRT and HRT-II, and to establish which parameter(s) provided the most repeatable and reliable measurements. An investigation into the factors that may affect the repeatability of the measurement of this parameter was conducted.


Subject selection

A total of 74 (43 OHT; 31 POAG) subjects were recruited from the ocular hypertension clinic at Moorfields Eye Hospital. OHT was defined as intraocular pressure (IOP) more than 21 mm Hg on two or more occasions and a baseline Humphrey 24–2 full threshold Advanced Glaucoma Intervention Study (AGIS) score of 0.10 POAG was defined as a consistent AGIS VF score greater than 0, and a pretreatment IOP greater than 21 mm Hg on two or more occasions. All subjects had previous experience of scanning laser tomography. For each subject, one eye was selected on the basis that it had a refractive error less than 12 dioptres of spherical power and no history of previous intraocular surgery. In subjects with lens opacity, the eye with the greater degree of opacity was preferentially selected although the presence of lens opacity itself was not a criterion for subject selection. This study adhered to the tenets of the Declaration of Helsinki and had local ethics committee approval and the subjects’ informed consent.

Testing protocol

Image acquisition was carried out by two experienced observers (ETW and NGS) at each of two visits within 6 weeks of each other. The testing sequences were performed using both the HRT (Version 2.01b; Heidelberg Engineering, Heidelberg, Germany) and HRT-II (Eye Explorer Version 1.7.0). In each subject, the scan focus (HRT and HRT-II) and depth of focus (HRT) that were used in the first visit were also used in the second visit. A series of three scans was acquired by each observer at a 10° field of view for the HRT, and 15° for the HRT-II (the two different scanning angles have the same degree of resolution). The following imaging protocol was adhered to for all subjects:

  • Visit 1: ETW then NGS then ETW

  • Visit 2: ETW then NGS.

Following IOP measurement at visit 1, the eyes were dilated using tropicamide 1% to enable one observer (NGS) to carry out lens grading. Subjective grading was carried out using the Lens Opacity Classification System III (LOCS III).11 Nuclear opalescence (NO, range 0.1–6.9), nuclear colour (NC, range 0.1–6.9), posterior subcapsular (P, range 0.1–5.9) and cortical (C, range 0.1–5.9) scores were graded against a standardised transparency. Scheimpflug photography was performed using the Case 2000 series (Marcher Diagnostics, Hereford, UK). The central nuclear dip (CND) value, derived from digitised densitograms, was used as an objective lens score.12

Image analysis

Heidelberg Eye Explorer (Version 1.7.0), the operating system of the HRT-II, was used to generate mean topography images and to perform image analysis. The term “HRT-II Explore” is used to indicate when Explorer was used to analyse the HRT-II mean topographies. The HRT topographies were imported into the HRT-II operating platform as HRT-Port files. HRT mean topographies were generated and analysed using the same Explorer software as the HRT-II images (termed “HRT Explorer”). HRT mean topographies were also generated and analysed using an option on the Explorer software called “HRT Classic ” which is derived from the older MS-DOS HRT software. HRT and HRT-II images may be examined interchangeably, and therefore longitudinally, using Explorer; HRT-II images are not compatible with HRT Classic. Contour lines were drawn by one observer (NGS) on the baseline mean topographies and exported to the subsequent images. Four different image sequences were analysed for both imaging devices:

  • intraobserver/intravisit (ETW then ETW, same visit)

  • interobserver/intravisit (ETW then NGS, same visit)

  • intraobserver/intervisit (ETW then ETW, different visits)

  • interobserver/intervisit (ETW then NGS, different visits).

The Standard reference plane was used for all analyses in this study. The mean pixel height standard deviation (MPHSD) was recorded for each mean topography as a proxy measure of image quality.

Statistical methods

Within subject coefficient of variation was used to examine the repeatability of each stereometric parameter. Within subject coefficient of variation was calculated with the following equations:

Embedded Image

CVw = 100×sw/mean of all repeated measurements

where sw is the common standard deviation of repeated measurements (within subject standard deviation) and CVw is the within subject coefficient of variation.

Test-retest repeatability was also assessed by constructing Bland-Altman plots and by estimating the repeatability coefficient (RC) as:

RC = sqrt(2)×1.96sw

This statistical method was applied when no relation was observed between observation magnitude and observation difference, and when the observation differences were normally distributed13 (JM Bland, personal communication, 2004).

Intraclass correlation coefficient (ICC) was used to estimate the reliability of the parameters generated.

Scatter plots and regression lines were constructed to identify which factors influenced test-retest measurement variability with significant associations assumed at p<0.05. The factors evaluated were age, refractive error (spherical and cylindrical power), IOP, lens score (NO, NC, PS, C, and CND), MPHSD, inter-test reference height difference, disc area, and baseline rim area.

All statistical analyses were performed using Medcalc Version (Medcalc Software, Mariakerke, Belgium) and SPSS Version 11.5 (SPSS Inc, Chicago, IL, USA).


The male:female ratio of the subjects was 41:33 and the right:left eye ratio was 43:31. The baseline subject characteristics are summarised in table 1.

Table 1

 Summary of baseline characteristics of test-retest subjects

Judged by the coefficient of variation, RA and mean cup depth had the highest measurement repeatability (table 2).

Table 2

 Coefficient of variation values for the stereometric parameters generated in the test-retest study

Judged by the ICC, mean cup depth, cup volume, cup area and RA were the most reliable parameters (table 3). There is no significant difference between the coefficients generated for these parameters in the situation most likely to be encountered in the longitudinal setting (interobserver/intervisit, IV).

Table 3

 Intraclass correlation coefficients for the stereometric parameters generated in the test-retest study

RA and mean cup depth measurements were the most consistently repeatable and reliable. As RA represents the more clinically meaningful measure, subsequent analyses were performed on this parameter.

RA interobserver/intervisit CVw (%) values were 10.3% (HRT Classic), 10.2% (HRT Explorer), and 7.8% (HRT-II Explorer). RA repeatability coefficients were similar for HRT Classic, HRT Explorer, and HRT-II Explorer, and were not affected by the observer or test interval (table 4). There was a tendency towards more repeatable measurements with HRT Classic, compared with HRT Explorer.

Table 4

 Rim area repeatability coefficients obtained with different types of HRT at different image sequences

Bland-Altman plots (fig 1) show similar interobserver/intervisit repeatability for the three HRT software platforms. The mean inter-test difference in all cases approximates zero.

Figure 1

 Bland-Altman plots of interobserver/intervisit rim area obtained with HRT Classic (A), HRT Explorer (B), and HRT-II Explorer (C).

Factors affecting repeatability

Inter-test reference height difference and mean image MPHSD were the two factors which consistently had a strong relation (R2>0.5) with inter-test RA difference for all testing permutations. Figures 2 and 3 show scatter plots, with a regression line, of intraobserver/intravisit RA difference against mean MPHSD and against inter-test reference height difference, respectively. Weaker relations (R2<0.5) of inter-test RA difference were observed with CND, LOCS III NC, and NO scores, and cylindrical power. Table 5 summarises these relations for intraobserver/intravisit RA.

Table 5

 Association (R2) between sources of variability and intraobserver/intravisit rim area differences (mm2)

Figure 2

 Scatter plot of intraobserver/intravisit rim area difference (mm2) against reference height difference (mm) using HRT Explorer. The regression line is also shown (R2 = 0.7, p<0.0001).

Figure 3

 Scatter plot of intraobserver/intravisit rim area difference (mm2) against mean image quality (MPHSD) using HRT Explorer. The regression line is also shown (R2 = 0.4, p<0.0001).

A weak relation (R2 = 0.1) with age was also found with HRT Classic and HRT-II in the interobserver/intravisit and interobserver/intervisit settings. No relation was observed with disc area, baseline RA, spherical power, spherical equivalent (spherical power + cylindrical power/2), IOP, time between visits, LOCS III PS, or LOCS III C scores.

A multiple regression was performed, with RA difference as the dependent variable, and the factors identified as influencing RA difference (that is, reference height difference, MPHSD, CND, NO, NC, age, and cylindrical power) as independent variables. Reference height difference (p<0.0001) and MPHSD (p = 0.05) were the only two significant variables (R2 = 0.7).

To elucidate which factors influenced image quality (as determined by MPHSD), a multiple regression was carried out for intraobserver/intravisit HRT Explorer, with MPHSD as the dependent variable and CND (as a single, objective measure of lens opacity), age, and cylindrical power as independent variables (R2 = 0.5). CND and cylindrical power displayed a highly significant (p<0.0001) relation with image quality, and age showed a weaker but significant (p = 0.03) relation.


Scanning laser tomography is a well established technique that provides reproducible ONH measurements.14,15 The topographic measures produced by the HRT and its predecessor, the laser tomographic scanner (LTS, Heidelberg Engineering, Heidelberg, Germany), have been demonstrated to be repeatable,16,17 and to have less variation compared with other techniques such as computer assisted planimetry.18 Little has been published about the reproducibility of the HRT-II.19,20 As the HRT-II is intended as a “clinical” instrument, its reproducibility under clinical conditions needs to be established. The profile of the subjects in this study was heterogeneous in terms of demographics, disease stage, refractive error, media opacity, and image quality, and therefore simulates the patient profile encountered in clinic. Image quality has previously been shown to be associated with pupil size and the degree of lens opacity (both objective scoring and LOCS III grading). Image quality was seen to improve with pupillary dilation but the improvements were often small.21 Pupil size was therefore not taken into consideration in this study. None of the subjects was taking miotic medications at the time of the study, although this was not a recruitment criterion.

Since the publication of the original reproducibility studies of the HRT,17,22 the Windows based Explorer platform has been introduced. From this perspective, this study is the first to examine the repeatability of HRT defined morphometric parameters using the newer software.

RA and mean cup depth were the most repeatable parameters for both devices. Some caution is required when interpreting coefficients of variability as some parameters, such as cup area and cup shape measure, have mean values of low magnitude (approaching zero). Another difficulty is the interpretation of differences between ICC values of a similar magnitude. It is therefore unlikely that there is any real difference in measurement reliability between mean cup depth, cup area, cup volume, and RA. It should also be noted that the ICC values depend on the variability of the sample population. As our sample was enriched with eyes with lenticular opacity, the ICC values may not be applicable to other populations with less cataract.

Overall, RA and mean cup depth were consistently the most repeatable and reliable of the parameters measured. This concurs with previous findings.23 The findings from another study showed that mean cup depth and cup area were the least variable parameters measured with the HRT-II.19 There is, however, no advantage in measuring cup area as it is merely the difference between disc area (kept constant in Explorer) and RA. As it contains the retinal ganglion cell axons, RA is a meaningful parameter for physicians. It has also been shown to discriminate between normal, glaucoma, and OHT subjects,24–26 and is therefore an appropriate candidate for the assessment of progression.

In this study, the repeatability of RA measurements was similar with both devices (RC = 0.2 – 0.3 mm2), irrespective of observer or test interval. Similar repeatability between imaging performed at the same visit or at different visits is consistent with a previous study, where no difference was identified in the short term and long term variability of topographic measurements.27 The HRT-II performs at least as well as the HRT. The similar level of RA repeatability between HRT-II and HRT Explorer analyses indicate that the two methods could theoretically be used interchangeably in a longitudinal setting.

The sources of variability for the HRT have been documented and include patient/scanner misalignment,28 and interobserver differences in optic disc contour line drawing.18,29,30 The present study identifies inter-test reference height difference to be the most consistent factor related to test-retest variability. It has previously been reported that the use of a 320 μm reference plane reduced RA variability, compared with the Standard reference plane.31 Image quality, as recorded by MPHSD, is another factor consistently found to influence variability. MPHSD is a gauge of the variability of pixel height measurements across the three topographic images used to construct the mean image.14 This study shows that image quality was, in turn, influenced by lens opacity, age, and degree of astigmatism. Sihota et al also found a significant correlation between the test-retest variability of the HRT-II and age and degree of astigmatism.19 Our results suggest that MPHSD may be an appropriate summary measure for the effect of these factors. It is therefore possible to predict repeatability coefficients for various levels of image quality without having to measure the patient’s age, degree of media opacity, or astigmatism.

In conclusion, this study indicates that RA may be an appropriate measure when monitoring glaucoma progression, as its measurements consistently showed excellent repeatability and reliability. Repeatability was similar with both the HRT and HRT-II, irrespective of observer or test interval. Reference height difference and image quality were found to be the factors that influenced RA variability most. The findings of this study will be used as the basis for suggesting strategies for improving test-retest repeatability. Once this is achieved, strategies for monitoring stereometric parameter progression can be devised and tested.

View Abstract


  • Competing interests: NGS was funded by a Friends of Moorfields research fellowship and through an unrestricted grant from Heidelberg Engineering.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.