Article Text
Statistics from Altmetric.com
Black magic or medical statistics?
Sample size prediction is an essential topic in the documentation of biometrical considerations for grant applications and in trial protocols, which will be submitted to drug authorities; lacking information on intended sample sizes and the underlying statistical power often result in severe amendments or even rejection of the submission. Therefore, this editorial intends to increase flexibility of clinical investigators in the communication with biometrical consultants and administrative authorities concerning these planning aspects in clinical trials.
The most important determinants of sample size are the study design and the clinical end point's scale level. This text will consider two different strategies in design—the paired data approach and the two sample approach. Paired data designs refer to intraindividual comparisons. Consider, for example, the comparison of the activity duration of two licensed mydriatic agents. For each study subject the application of drug I is randomised to one eye, drug II to the other. Being able to intraindividually compare the substances, their differentation becomes more feasible than in an unpaired study design. The latter applies drug I to one group of people and drug II to a different group. It would suffer from additional interindividual variation between the study subjects. The paired design, however, eliminates this additional data variation in the first place and therefore allows for an immediate treatment comparison. This reduction in data variation results in a remarkable sample size reduction. The second topic to be determined during the planning phase of studies is the scale of the clinical end point. In principle, one can distinguish between continuous parameters (for example, intraocular pressure) and categorial ones (for example, “occurrence of postoperative subjective photic phenomena: none/slight/severe”). An important special case of categorial parameters is the binary end point (“therapeutic success: yes/no”), well known continuous parameters are the normally distributed ones. The latter can be characterised via mean and standard deviation, whereas binary parameters are characterised via success or event frequencies.
INGREDIENTS FOR SAMPLE SIZE DETERMINATION
The following will assume, that the clinical end point is normally distributed—that is, mean and standard deviation will suffice for its characterisation. Accordingly, group differences can be represented as differences between the treatment groups' mean values. The two sample Welch t test can then be applied to test the existence of a “mean difference” between the therapy groups. Four parameters have to be fixed in advance:

The significance level α denotes the maximum tolerable probability of falsely transferring group differences from a study onto its underlying patient population. Common values of α are 0.01 and 0.05.

The statistical power 1 − β refers to the probability of being able to detect existing differences between groups based on the patient number at hand. Accordingly, β denotes the probability of failing to detect group differences within a study. Common power values are 0.80 and 0.90.

The third and most important parameter is the “minimum detectable difference” between the treatment groups under consideration. Large differences between groups will be detected with fewer patients than very small differences. On the other hand, researchers should ask themselves whether such a small group difference is of clinical relevance. Therefore, the minimum group difference, which would represent a clinically relevant group differentation, has to be determined. This mean difference is usually denoted δ and depends on the clinical end point's range and unit.

Finally, one has to specify the standard deviation of the clinical parameter under consideration, or at least a possible range for it. The more variation in the data, the less “precise” the study results will turn out. Accordingly, large variation may prevent studies from detecting existing clinical differences between the treatment groups. Whereas δ results from clinical consideration, it would be best to derive information on the standard deviation either from similar studies in literature or from an internal pilot study.
In general, an increase in sample size will be caused by either decreasing the significance level α, by raising the statistical power 1− β, by demanding smaller minimum detectable differences δ and by larger variation. A minimum necessary group size based on these consideration, however, ensures, that a mean group difference δ can be detected at the significance level α with a minimum statistical power of 1− β.
A significance test for paired group comparisons is the paired t test, which can be used for the detection of intraindividual mean differences. Therefore, all above assertions remain valid for the paired data scenario, despite the fact that δ now denotes an intraindividual mean difference.
Instead of using mean and standard deviation, binary end points should be characterised by the frequency of the clinical outcome of interest (“success frequencies”). Sample size prediction in the two sample scenario is therefore based on information on the treatment groups' success frequencies.
Example 1: continuous end point, two sample design
The following summarises the sample size prediction for a controlled trial on the comparison of trabeculotomy and trabeculectomy; the clinical end point of primary interest is the percentage reduction in intraocular pressure 8 weeks after surgery. Only one eye of each patient is included into the study—that is, a two sample design is considered. Study investigators expect a mean decrease to 75% of the initial pressure before surgery (SD 20%–30%) for patients undergoing trabeculectomy, for the trabeculotomy a mean percentage reduction of 80% or even only 85% is expected (SD 20%–30%). Therefore, the minimum detectable difference δ ranges between 5% and 10% and is considerably small. Statistical parameters are α = 0.05 and 1 − β = 0.90 (and, in addition, 0.80 for the sake of illustration). Table 1 shows that the detection of a mean difference of 75% versus 80% under the assumed standard deviations 20% will afford 338 patients per therapy arm, if a statistical power of 0.90 is demanded. Reducing the power to 0.80 yields a group size of 253. If, however, the mean group difference δ is expected to be 10% instead of only 5%, then the group size reduces to 86 patients (power 0.90) or 64 (power 0.80), respectively. Note that assuming a larger standard deviation (30% instead of 20% in both groups) merely doubles the above sample sizes. This crucially illustrates the sensitivity of predicted sample sizes concerning their clinical input parameters.
Example 2: continuous end point, paired design
A controlled trial is intended to compare the effect of a single surgical glaucoma therapy versus glaucoma surgery combined with an additional cataract intervention (phacoemulsification) with regard to the postsurgical laser flare meter values. Since a remarkable interindividual variation will be contained in this clinical end point, investigators proposed to only recruit patients whose eyes both have to undergo surgical intervention and allow for intraindividual randomisation onto the surgical therapies. The intraindividual difference in laser flare meter values will be a suitable clinical end point. Table 2 provides the predicted group sizes, if a significance level of α = 0.05 and the statistical power 0.90 are demanded. The mean intraindividual flare difference δ was varied between 40 and 70 photonic counts per ms (pc/ms) and the standard deviation (SD) of the differences between 20 and 60 pc/ms: For a mean difference of 60 (40) pc/ms a study size of n = 7 is found sufficient; if, however, the smaller mean difference of 40 (40) pc/ms has to be detected, the sample size merely doubles to n = 13 patients. Again an increase in data variation, as measured by SD, results in a remarkable increase in sample sizes. Note that Table 2 also illustrates the remarkable reduction in sample size, if paired study designs are used compared with the analogous two sample trials. The corresponding ethical benefit and the gain in cost efficiency are obvious. Study duration, however, may be remarkably increased in the above setting, since the recruitment of patients, whose eyes allow for intraindividual randomisation, may turn out to be difficult.
Example 3: binary end point, two sample design
A prospective trial on cataract incision techniques is designed to compare the influence of sclerocorneal versus clear corneal incision on the binary clinical outcome “increase in visus of at least two stages 8 weeks after surgery.” Fisher's exact test is indicated for the comparison of these two therapies' success frequencies p (sclerocorneal incision) and q (clear cornea incision). Table 3 provides the predicted group sizes, if α = 0.01 and 1− β = 0.90 (in parentheses 0.80) are specified. Again group sizes increase with a decreasing difference between p and q. If a therapy difference of p = 40% versus q = 20% is regarded as clinically relevant, Table 3 proposes a group size of 150–160 per treatment. This number of patients also seems recruitable in considerable time, and therefore a monocentric trial has been submitted for review by the authorities.
CONCLUSION
This tutorial text summarises the principal strategies of planning controlled trials in ophthalmology concerning a priori computation of sample sizes. The central aspects of this issue were illustrated in terms of recent trials, where sample size prediction was performed as a documentation for reviewers in grant applications and for drug administration. The paper intended to increase flexibility of clinical investigators in designing their trials based on software packages for sample size determination and in communication with medical biometricians on this issue. In general, study designs for intraindividual comparison require much smaller sample sizes than the corresponding interindividual approaches; the decrease in sample size can amount to up to 60% or more. However, recruitment times of patients, whose eyes allow for intraindividual randomisation of the concurrent therapies may become exhaustive. Nevertheless, designs for the intraindividual comparison of therapeutic regimens appear quite attractive in ophthalmology. They should, however, only be considered after ensuring recruitment of enough patients who are suitable for intraindividual randomisation. For either the interindividual and the intraindividual comparison the sample sizes will be primarily determined by the order of the expected difference between the therapeutic regimens under consideration.