Article Text

Download PDFPDF

Repeatability and reproducibility of upper eyelid measurements
  1. K Boboridis1001,
  2. A Assi1001,
  3. A Indar1001,
  4. C Bunce1002,
  5. A G Tyers1001
  1. 1001Adnexal Service, Moorfields Eye Hospital, London EC1V 2PD, UK, 1002Glaxo Department of Epidemiology, Moorfields Eye Hospital, London EC1V 2PD, UK
  1. Mr A G Tyers, Salisbury District Hospital, Salisbury SP2 8BJ, UK


AIM The aim of this study was to assess the repeatability and reproducibility by physicians of upper lid measurements and to investigate the influence of clinical experience on the learning curve effect.

METHODS Both eyes of 22 outpatients were assessed for three basic measures of ptosis: marginal reflex distance (MRD) for upper and lower lids, upper lid skin crease (SC), and levator function (LF). Patients with variable eyelid positions were excluded. The patients were measured twice by a consultant and once by each of a clinical fellow, a specialist registrar, and a senior house officer in random order. Each observer was masked to their colleagues' results and followed a standard measurement protocol. Data were analysed using Bland-Altman plots.

RESULTS Consultant repeatability was high and consistent, the median difference between measures being 0 for each of the four parameters. Clinically acceptable reproducibility was shown in all measurements for even the least experienced physician and was particularly consistent for extreme observations. There was evidence of a learning curve effect.

CONCLUSIONS These results suggest that interobserver and intraobserver variability in assessment of upper lid ptosis using a standard measurement protocol is low and clinically acceptable when the technique of assessment is standardised.

  • repeatability
  • reproducibility
  • marginal reflex distance
  • skin crease
  • levator function
  • eyelid
  • measurement

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Fundamental to the assessment of the patient and the choice of operation in ophthalmic plastic surgery is the measurement of certain parameters of the eyelids. These include function in the levator muscle of the upper eyelid, the position of the eyelids with the eyes in the primary position, and the level of the upper lid skin crease. Incorrect values may lead to an incorrect diagnosis or an inappropriate operation. It is our perception that clinicians inexperienced in ophthalmic plastic surgery frequently make errors in these measurements. The purpose of this study was to test interobserver and intraobserver variability in such assessment. Clinicians of varying experience were included to allow examination of any learning curve effect.

Patients and methods

Both eyes of 22 patients attending the ophthalmic plastic surgery clinic were assessed for the levator function (LF), the position of all four eyelids using the margin reflex distance method (MRD), and the level of the upper lid skin crease (SC).

A number of clinical conditions were included. Patients with variable eyelid position (for example, myasthenia gravis), inflamed eyelids, or photophobia were excluded. For each measurement the clinician sat in front of the patient at the same level and both looked in the primary position.

To measure LF the patient's eyebrow was stabilised by pressure exerted with the examiner's thumb. The patient was requested to look fully up, then fully down, while the excursion of the eyelid was measured against a ruler. This was repeated three times and the average measurement was recorded.

To measure MRD the patient was requested to look at a light source (pen torch) and the distances from the corneal light reflex to the upper eyelid and to the lower eyelid were recorded.

To measure the level of the SC the patient was asked to look down, the upper lid skin fold was gently raised if necessary, and the distance of the skin crease from the eyelid margin was recorded. The values were estimated to the nearest full millimetre.

Four clinicians took part in the study: a consultant, a fellow in ophthalmic plastic surgery, a specialist registrar, and a senior house officer. Both junior clinicians were attached to the Adnexal Service.

Each clinician assessed each patient in random order, the consultant measuring patients twice. Observers were masked to their fellows' assessments. The collection of data was spread over a 3 month period so that it would be possible to assess any trend in the accuracy of the readings obtained by any of the observers.


Table 1 presents the range of measures on each parameter made by each clinician. The data were analysed with the graphical method described by Bland and Altman for method comparison studies. To test the intraobserver accuracy the difference between the two sets of measurements recorded by the consultant was plotted against the average of the two for each parameter. The average of the two sets of measurements made by the consultant was regarded as the best estimate of the true value. To test the interobserver accuracy the average value and the difference between this value and the results of the other observers were plotted as above. To assess whether there was any learning curve the position of the patient in the series and the time from the beginning of the study were noted.

Table 1

Range of eyelid measurements against clinician

Tables 2 and 3 provide information on repeatability and reproducibility, respectively. The right and left eyes were analysed separately.

Table 2

Consultant repeatability in eyelid assessment

Table 3

Clinician reproducibility in eyelid assessment


Table 1 shows that the measurements lay within the expected ranges for the parameters measured.


Table 2 illustrates that consultant repeatability was high for all four parameters. On average, the second measure was the same as the first (shown by the median difference being 0 in all case) and never differed by more than 2 mm. There was little evidence that repeatability varied with the size of the measurement (Fig 1) or with time (Fig 2).

Figure 1

Bland-Altman plot of consultant repeatability for measurement of marginal reflex distance for upper eyelid.

Figure 2

Repeatability of consultant measures of levator function against time.


Table 3 illustrates high reproducibility between the clinical fellow and the consultant. On average, there was no difference between their measures of MRD and a difference of just 0.5 mm for SC and 1 mm for LF. There was only slightly less reproducibility between the specialist registrar and the consultant with, on average, no difference between their MRD measures and a difference of 0.5 mm for SC. Table 3does show, however, that the specialist registrar tended to record LF at 2 mm greater than the consultant and recorded a value 5 mm greater for one patient. There was slightly poorer reproducibility between the consultant and the senior house officer with, on average, differences of 0.5 mm, 1 mm, and 1.5 mm for MRD (upper), SC, and LF (right eye), respectively. While greater absolute differences between measures were seen in LF assessment, the range of acceptable values was greater for LF than for SC and MRD, so the differences were proportionately of similar clinical significance.

Figure 3 illustrates the slight increase in SC reproducibility over time. While in general there seemed little indication from these data of any variability in reproducibility over the parameter ranges, there is some suggestion that agreement between clinicians is greater at the extremes of LF (Fig 4).

Figure 3

Reproducibility of skin crease measurement between senior house officer (SHO) and consultant with time.

Figure 4

Reproducibility of levator function measurement between clinical fellow and consultant.


This study suggests that interobserver and intraobserver variability in upper eyelid ptosis assessment, when conducted in a standardised fashion, is modest and clinically acceptable, particularly in clinicians of greater experience. We have found some evidence of learning curve effects, both short and long term, and there is some suggestion of greater agreement at extremes of LF. This seems intuitive as a clinician may well check unusual observations more thoroughly than those that fall within typically encountered ranges.