Original articleAdjusting for multiple testing—when and how?
Introduction
Many trials in biomedical research generate a multiplicity of data, hypotheses, and analyses, leading to the performance of multiple statistical tests. At least in the setting of confirmatory clinical trials the need for multiple test adjustments is generally accepted 1, 2 and incorporated in corresponding biostatistical guidelines [3]. However, there seems to be a lack of knowledge about statistical procedures for multiple testing. Recently, some authors tried to establish that the statistical approach of adjusting for multiple testing is unnecessary or even inadequate 4, 5, 6, 7. However, the main arguments against multiplicity adjustments are based upon fundamental errors in understanding of simultaneous statistical inference 8, 9. For instance, multiple test adjustments have been equated with the Bonferroni procedure [7], which is the simplest, but frequently also an inefficient method to adjust for multiple testing.
The purpose of this article is to describe the main concept of multiple testing, several kinds of significance levels, and the various situations in which multiple test problems in biomedical research may occur. A nontechnical overview is given to summarize in which cases and how adjustments for multiple hypotheses tests should be made.
Section snippets
Significance tests, multiplicity, and error rates
If one significance test at level α is performed, the probability of the type 1 error (i.e., rejecting the individual null hypothesis although it is in fact true) is the comparisonwise error rate (CER) α, also called individual level or individual error rate. Hence, the probability of not rejecting the true null hypothesis is 1 − α. If k independent tests are performed, the probability of not rejecting all k null hypotheses when in fact all are true is (1 − α)k. Hence, the probability of
When are adjustments for multiple tests necessary?
A simple answer to this question is: If the investigator only wants to control the CER, an adjustment for multiple tests is unnecessary; if the investigator wants to control the EER or MEER, an adjustment for multiple tests is strictly required. Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions 11, 12. In addition to the problem of deciding which error rate should
General procedures based upon P values
The simplest multiple test procedure is the well-known Bonferroni method [17]. Of k significance tests, those accepted as statistically significant have P values smaller than α/k, where α is the MEER. Adjusted P values are calculated by k × Pi, where Pi for i = 1, … , k are the individual unadjusted P values. In the same manner Bonferroni adjusted confidence intervals can be constructed by dividing the multiple confidence level with the number of confidence intervals. The Bonferroni method is
Special procedures for multiple test adjustments
One main advantage of the general multiple test procedures based upon P values is that they are universally applicable to different types of data (continuous, categorical, censored) and different test statistics (e.g., t, χ2, Fisher, logrank). Naturally, these procedures are unspecific and special adjustment procedures have been developed for certain questions in specific multiplicity situations.
Discussion
The problem of multiple hypotheses testing in biomedical research is quite complex and involves several difficulties. Firstly, it is required to define which significance tests belong to one experiment; that means which tests should be used to make one final conclusion. Secondly, the particular error rate to be under control must be chosen. Thirdly, an appropriate method for multiple test adjustment has to be found that is applicable and feasible in the considered situation. Many multiple test
Acknowledgments
We thank Dr. Gernot Wassmer (Cologne, Germany) for his careful reading of the manuscript and his valuable comments.
References (65)
Clinical trials with multiple outcomesa statistical perspective on their design, analysis, and interpretation
Contr Clin Trials
(1997)- et al.
Some statistical methods for multiple endpoints in clinical trials
Contr Clin Trials
(1997) P-value interpretation and alpha allocation in clinical trials
Ann Epidemiol
(1998)The problem of cogent subgroupsa clinicostatistical tragedy
J Clin Epidemiol
(1998)Repeated significance tests on accumulating survival data
J Clin Epidemiol
(1999)- et al.
The UK Prospective Diabetes Study
Lancet
(1998) - et al.
Statistical considerations for multiplicity in confirmatory protocols
Drug Inf J
(1996) - et al.
Some comments on frequently used multiple endpoint adjustment methods in clinical trials
Stat Med
(1997) Biostatistical methodology in clinical trials in applications for marketing authorizations for medical products
Stat Med
(1995)No adjustments are needed for multiple comparisons
Epidemiology
(1990)
Multiple comparisons and related issues in the interpretation of epidemiologic data
Am J Epidemiol
Describing data requires no adjustment for multiple comparisonsa reply from Savitz and Olshan
Am J Epidemiol
What's wrong with Bonferroni adjustments
BMJ
Other method for adjustment of multiple testing exists
BMJ
Multiple test procedures other than Bonferroni's deserve wider use
BMJ
Multiple testing in clinical trials
Stat Med
Invited commentaryRe: “Multiple comparisons and related issues in the interpretation of epidemiologic data.”
Am J Epidemiol
Multiple comparisons, explained
Am J Epidemiol
The appropriateness of analysis of variance and multiple comparison procedures
Biometrics
Simultaneous statistical inference
Multiple comparison procedures
Adjusted p-values for simultaneous inference
Biometrics
Multiple significance teststhe Bonferroni method
BMJ
Annotationon the Holm, Simes, and Hochberg multiple test procedures
Am J Public Health
A simple sequentially rejective multiple test procedure
Scand J Stat
Adjusting for multiple testing when reporting research resultsthe Bonferroni vs Holm methods
Am J Public Health
On closed testing procedures with special reference to ordered analysis of variance
Biometrika
Resampling-based multiple testing
Reader reactionon adjusting P-values for multiplicity
Biometrics
Comparing several groups using analysis of variance
BMJ
Comparing means of several groups
N Engl J Med
Pairwise multiple comparison proceduresa review
Psychol Bull
Cited by (2085)
Comparison of patient-reported outcomes of physical activity and accelerometry in people with multiple sclerosis and ambulatory impairment: A cross-sectional study
2024, Multiple Sclerosis and Related DisordersAssociation between neutrophil to lymphocyte ratio and inflammatory biomarkers in patients with a first episode of psychosis
2024, Journal of Psychiatric ResearchA Longitudinal Study Evaluating Sexual Health Outcomes and Prioritization in Patients Undergoing Chemoradiation for Human Papillomavirus-Associated Oropharyngeal Cancer
2024, International Journal of Radiation Oncology Biology Physics