Medical statistics plays a key role in clinical research by helping to avoid errors of interpretation due to the play of chance. It is, however, critical to understand the limits of what statistical analysis provides and interpret the findings accordingly. Statistical analysis can summarise the statistical evidence but it cannot tell us whether a difference is important per se

First, when a statistical hypothesis test is performed we are usually implicitly testing for evidence of a difference of

Second, the p value is not a direct measure of the strength of evidence for the null hypothesis. Instead, it is the probability of obtaining a result as, or more extreme than, the one observed assuming the null hypothesis is true. It is therefore, at best, a partial and indirect measure of the evidence regarding the truth of the null hypothesis. In particular, it is not the probability that the null hypothesis is true given the data.

Third, the criterion used for statistical significance, typically a p value of less than 0.05 (ie, 5% significance level), is arbitrary. It is based on convention rather than statistical theory. It guards against making one type of error too often (rejecting the null hypothesis when it is true), but gives no protection against making the other type of error (not rejecting the null hypothesis when it is false).

Fourth, even once we meet our criteria we must take into consideration the magnitude of an effect. It's all fine and well saying we have found a significant difference, but it is crucial to consider how big this difference is and whether we view it as clinically important. We should design our study such that the difference which would be statistically significant would also be considered to be clinically important. It is also true, as covered in an earlier BJO statistics note,

Wherever possible, a statistical method should be used that can estimate the magnitude of the observed effect (an effect size) along with the associated uncertainty around this estimate, rather than only carrying out a statistical (hypothesis) test and presenting the corresponding p value. In the commonest situation of comparing two independent groups, a t test and corresponding CI for the difference in means is preferred, provided its underlying assumptions hold, over a non-parametric alternative such as a Mann–Whitney U test (it should be noted that this non-parametric test does not compare the group means or medians but the two distributions). Effect measures differ according to the type of outcome often with several alternatives available. Common effect measures for binary outcomes include differences in proportions, risk ratios or ORs. A HR (which may be interpreted similarly to a relative risk) can be used for time-to-event data. These effect measures are often natural products of a particular statistical analysis (eg, a Cox regression analysis for time-to-event data estimates HRs, logistic regression for a dichotomous outcome estimates ORs). We note in passing that a Bayesian statistical approach offers a theoretically appealing alternative to conventional statistical testing which avoids the use of p values although not without difficulties both in principles and practice.

The importance of considering the effect size is illustrated in

Ocular adverse event rates from hypothetical randomised controlled trial scenarios comparing a 12-week course of two IOP-lowering drugs for open-angle glaucoma

Scenario | Drug A | Drug B | Difference in percentages | p Value |
---|---|---|---|---|

1 | 1/30 (3) | 8/30 (27) | −23 (−41 to −5) | 0.026 |

2 | 4/30 (13) | 10/30 (33) | −20 (−40 to 2) | 0.125 |

3 | 28/300 (9) | 45/300 (15) | −6 (−11 to −0.4) | 0.045 |

4 | 6/300 (2) | 9/300 (3) | −1 (−4 to 2) | 0.603 |

Calculations were carried out in Stata

IOP, intraocular pressure; n, number of adverse events; N, number of patients.

The trials in scenarios 3 and 4 are much larger (10 times the size) and therefore have greater statistical precision and are able to detect substantially smaller effects. If we were only to consider the p value in scenario 1, we would conclude that the drugs are different, but this provides no information about the magnitude of the difference—which in this instance is considerable. If we compare scenarios 1 and 3, we see that the p values are similar but the effect sizes are very different (23% vs 6%). If we compare scenarios 3 and 4, we see that both have much tighter CIs with a difference in the rate of adverse events that would be viewed as ‘clinically important’ likely to be ruled out in scenario 4.

Interpreting the result of a statistical analysis relies upon more than statistical understanding. It is helpful to differentiate between ‘statistical significance’ and ‘clinical importance’. If there is statistical evidence of an effect, we wish to know if it matters. Conversely, where there is no statistical evidence of an effect, we wish to know if a clinically important effect has been ruled out. Calculating an effect size and a corresponding 95% CI provides the information upon which such judgements can be made. It can be remarkably difficult to know what is clinically important in a particular context. How easily clinical importance can be assessed varies by the setting, perspective adopted and outcome analysed. Quality of life measures are particularly difficult to interpret, whereas mortality is more straightforward. A variety of approaches of varying complexity have been proposed (including minimal clinically important difference approaches) which seek to provide a transparent, reproducible and robust way to achieve this.

The magnitude of an observed effect and the uncertainty regarding it should always be considered when interpreting statistical evidence. Whenever possible, a statistical analysis which produces an effect size estimate with associated uncertainty (eg, difference in means and its CI) rather than solely a p value should be used and reported.

The post of CB is partly funded by the NIHR Biomedical Research Centre based at Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology. The views expressed in this article are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

Valentina Cipriani, David Crabb, Phillippa Cumberland, Gabriela Czanner, Paul Donachie, Andrew Elders, Marta Garcia Finana, Neil O'Leary, Rachel Nash, Ana Quartilho, Luke Saunders, Selvaraj Sivasubramaniam, Chris Rogers, Simon Skene, Irene Stratton, Wen X.

JAC drafted the paper, CB reviewed and revised the paper, CJD and NF conducted an internal peer review and provided additional comment.

None.

Not commissioned; externally peer reviewed.