Original article
Adjusting for multiple testing—when and how?

https://doi.org/10.1016/S0895-4356(00)00314-0Get rights and content

Abstract

Multiplicity of data, hypotheses, and analyses is a common problem in biomedical and epidemiological research. Multiple testing theory provides a framework for defining and controlling appropriate error rates in order to protect against wrong conclusions. However, the corresponding multiple test procedures are underutilized in biomedical and epidemiological research. In this article, the existing multiple test procedures are summarized for the most important multiplicity situations. It is emphasized that adjustments for multiple testing are required in confirmatory studies whenever results from multiple tests have to be combined in one final conclusion and decision. In case of multiple significance tests a note on the error rate that will be controlled for is desirable.

Introduction

Many trials in biomedical research generate a multiplicity of data, hypotheses, and analyses, leading to the performance of multiple statistical tests. At least in the setting of confirmatory clinical trials the need for multiple test adjustments is generally accepted 1, 2 and incorporated in corresponding biostatistical guidelines [3]. However, there seems to be a lack of knowledge about statistical procedures for multiple testing. Recently, some authors tried to establish that the statistical approach of adjusting for multiple testing is unnecessary or even inadequate 4, 5, 6, 7. However, the main arguments against multiplicity adjustments are based upon fundamental errors in understanding of simultaneous statistical inference 8, 9. For instance, multiple test adjustments have been equated with the Bonferroni procedure [7], which is the simplest, but frequently also an inefficient method to adjust for multiple testing.

The purpose of this article is to describe the main concept of multiple testing, several kinds of significance levels, and the various situations in which multiple test problems in biomedical research may occur. A nontechnical overview is given to summarize in which cases and how adjustments for multiple hypotheses tests should be made.

Section snippets

Significance tests, multiplicity, and error rates

If one significance test at level α is performed, the probability of the type 1 error (i.e., rejecting the individual null hypothesis although it is in fact true) is the comparisonwise error rate (CER) α, also called individual level or individual error rate. Hence, the probability of not rejecting the true null hypothesis is 1 − α. If k independent tests are performed, the probability of not rejecting all k null hypotheses when in fact all are true is (1 − α)k. Hence, the probability of

When are adjustments for multiple tests necessary?

A simple answer to this question is: If the investigator only wants to control the CER, an adjustment for multiple tests is unnecessary; if the investigator wants to control the EER or MEER, an adjustment for multiple tests is strictly required. Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions 11, 12. In addition to the problem of deciding which error rate should

General procedures based upon P values

The simplest multiple test procedure is the well-known Bonferroni method [17]. Of k significance tests, those accepted as statistically significant have P values smaller than α/k, where α is the MEER. Adjusted P values are calculated by k × Pi, where Pi for i = 1, … , k are the individual unadjusted P values. In the same manner Bonferroni adjusted confidence intervals can be constructed by dividing the multiple confidence level with the number of confidence intervals. The Bonferroni method is

Special procedures for multiple test adjustments

One main advantage of the general multiple test procedures based upon P values is that they are universally applicable to different types of data (continuous, categorical, censored) and different test statistics (e.g., t, χ2, Fisher, logrank). Naturally, these procedures are unspecific and special adjustment procedures have been developed for certain questions in specific multiplicity situations.

Discussion

The problem of multiple hypotheses testing in biomedical research is quite complex and involves several difficulties. Firstly, it is required to define which significance tests belong to one experiment; that means which tests should be used to make one final conclusion. Secondly, the particular error rate to be under control must be chosen. Thirdly, an appropriate method for multiple test adjustment has to be found that is applicable and feasible in the considered situation. Many multiple test

Acknowledgments

We thank Dr. Gernot Wassmer (Cologne, Germany) for his careful reading of the manuscript and his valuable comments.

References (65)

  • D.A. Savitz et al.

    Multiple comparisons and related issues in the interpretation of epidemiologic data

    Am J Epidemiol

    (1995)
  • D.A. Savitz et al.

    Describing data requires no adjustment for multiple comparisonsa reply from Savitz and Olshan

    Am J Epidemiol

    (1998)
  • T.V. Perneger

    What's wrong with Bonferroni adjustments

    BMJ

    (1998)
  • M. Aickin

    Other method for adjustment of multiple testing exists

    BMJ

    (1999)
  • R. Bender et al.

    Multiple test procedures other than Bonferroni's deserve wider use

    BMJ

    (1999)
  • P. Bauer

    Multiple testing in clinical trials

    Stat Med

    (1991)
  • J.R. Thompson

    Invited commentaryRe: “Multiple comparisons and related issues in the interpretation of epidemiologic data.”

    Am J Epidemiol

    (1990)
  • S.N. Goodman

    Multiple comparisons, explained

    Am J Epidemiol

    (1998)
  • P.C. O'Brien

    The appropriateness of analysis of variance and multiple comparison procedures

    Biometrics

    (1983)
  • R.G. Miller

    Simultaneous statistical inference

    (1966)
  • Y. Hochberg et al.

    Multiple comparison procedures

    (1987)
  • S.P. Wright

    Adjusted p-values for simultaneous inference

    Biometrics

    (1992)
  • J.M. Bland et al.

    Multiple significance teststhe Bonferroni method

    BMJ

    (1995)
  • B. Levin

    Annotationon the Holm, Simes, and Hochberg multiple test procedures

    Am J Public Health

    (1996)
  • S. Holm

    A simple sequentially rejective multiple test procedure

    Scand J Stat

    (1979)
  • M. Aickin et al.

    Adjusting for multiple testing when reporting research resultsthe Bonferroni vs Holm methods

    Am J Public Health

    (1996)
  • R. Marcus et al.

    On closed testing procedures with special reference to ordered analysis of variance

    Biometrika

    (1976)
  • P.H. Westfall et al.

    Resampling-based multiple testing

    (1993)
  • P.H. Westfall et al.

    Reader reactionon adjusting P-values for multiplicity

    Biometrics

    (1993)
  • D.G. Altman et al.

    Comparing several groups using analysis of variance

    BMJ

    (1996)
  • K. Godfrey

    Comparing means of several groups

    N Engl J Med

    (1985)
  • J. Jaccard et al.

    Pairwise multiple comparison proceduresa review

    Psychol Bull

    (1984)
  • Cited by (2085)

    View all citing articles on Scopus
    View full text