logo
    Comparison of Methods for Estimating the Number of True Null Hypotheses in Multiplicity Testing
    110
    Citation
    20
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    Abstract When a large number of statistical tests is performed, the chance of false positive findings could increase considerably. The traditional approach is to control the probability of rejecting at least one true null hypothesis, the familywise error rate (FWE). To improve the power of detecting treatment differences, an alternative approach is to control the expected proportion of errors among the rejected hypotheses, the false discovery rate (FDR). When some of the hypotheses are not true, the error rate from either the FWE- or the FDR-controlling procedure is usually lower than the designed level. This paper compares five methods used to estimate the number of true null hypotheses over a large number of hypotheses. The estimated number of true null hypotheses is then used to improve the power of FWE- or FDR-controlling methods. Monte Carlo simulations are conducted to evaluate the performance of these methods. The lowest slope method, developed by Benjamini and Hochberg (2000) on the adaptive control of the FDR in multiple testing with independent statistics, and the mean of differences method appear to perform the best. These two methods control the FWE properly when the number of nontrue null hypotheses is small. A data set from a toxicogenomic microarray experiment is used for illustration.
    Keywords:
    False Discovery Rate
    Multiple comparisons problem
    Null (SQL)
    Statistical power
    Alternative hypothesis
    In hypothesis testing, statistical significance is typically based on calculations involving p-values and Type I error rates. A p-value calculated from a single statistical hypothesis test can be used to determine whether there is statistically significant evidence against the null hypothesis. The upper threshold applied to the p-value in making this determination (often 5% in the scientific literature) determines the Type I error rate; i.e., the probability of making a Type I error when the null hypothesis is true. Multiple hypothesis testing is concerned with testing several statistical hypotheses simultaneously. Defining statistical significance is a more complex problem in this setting. A longstanding definition of statistical significance for multiple hypothesis tests involves the probability of making one or more Type I errors among the family of hypothesis tests, called the family-wise error rate. However, there exist other well established formulations of statistical significance for multiple hypothesis tests. The Bayesian framework for classification naturally allows one to calculate the probability that each null hypothesis is true given the observed data (Efron et al. 2001, Storey 2003), and several frequentist definitions of multiple hypothesis testing significance are also well established (Shaffer 1995). Soric (1989) proposed a framework for quantifying the statistical significance of multiple hypothesis tests based on the proportion of Type I errors among all hypothesis tests called statistically significant. He called statistically significant hypothesis tests discoveries and proposed that one be concerned about the rate of false discoveries1 when testing multiple hypotheses. This false discovery rate is robust to the false positive paradox and is particularly useful in exploratory analyses, where one is more concerned with having mostly true findings among a set of statistically significant discoveries rather than guarding against one or more false positives. Benjamini & Hochberg (1995) provided the first implementation of false discovery rates with known operating characteristics. The idea of quantifying the rate of false discoveries is directly related to several pre-existing ideas, such as Bayesian misclassification rates and the positive predictive value (Storey 2003).
    p-value
    Alternative hypothesis
    False Discovery Rate
    Multiple comparisons problem
    Statistical power
    Null (SQL)
    Citations (61)
    Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. In common exploratory microarray experiments, most genes are not expected to be differentially expressed. The family-wise error (FWE) rate and false discovery rate (FDR) are two common approaches used to account for multiple hypothesis tests to identify differentially expressed genes. When the number of hypotheses is very large and some null hypotheses are expected to be true, the power of an FWE or FDR procedure can be improved if the number of null hypotheses is known. The mean of differences (MD) of ranked p-values has been proposed to estimate the number of true null hypotheses under the independence model. This article proposes to incorporate the MD estimate into an FWE or FDR approach for gene identification. Simulation results show that the procedure appears to control the FWE and FDR well at the FWE=0.05 and FDR=0.05 significant levels; it exceeds the nominal level for FDR=0.01 when the null hypotheses are highly correlated, a correlation of 0.941. The proposed approach is applied to a public colon tumor data set for illustration.
    False Discovery Rate
    Multiple comparisons problem
    Null (SQL)
    Statistical power
    Citations (5)
    Multiple testings are instances that contain simultaneous tests for more than one hypothesis. When multiple testings are conducted at the same time, it is more likely that the null hypothesis is rejected, even if the null hypothesis is correct. If individual hypothesis decisions are based on unadjusted <i>p</i>-values, it is usually more likely that some of the true null hypotheses will be rejected. In order to solve the multiple testing problems, various studies have attempted to increase the power by taking into account the family-wise error rate or false discovery rate and statistics required for testing hypotheses. This article discuss methods that account for the multiplicity issue and introduces various statistical techniques.
    False Discovery Rate
    Multiple comparisons problem
    Alternative hypothesis
    Biostatistics
    Null (SQL)
    Statistical power
    p-value
    In hypothesis testing, the p value is in routine use as a tool to make statistical decisions. It gathers evidence to reject null hypothesis. Although it is supposed to reject the null hypothesis when it is false and fail to reject the null hypothesis when it is true but there is a potential to err by incorrectly rejecting the true null hypothesis and wrongly not rejecting the null hypothesis even when it is false. These are named as type I and type II errors respectively. The type I error (α error) is chosen arbitrarily by the researcher before the start of the experiment which serves as an arbitrary cutoff to bifurcate the entire quantitative data into two qualitative groups as 'significant' and 'insignificant'. This is known as level of significance (α level). Type II error (β error) is also predetermined so that the statistical test should have enough statistical power ((1-β)) to detect the statistically significant difference. In order to achieve adequate statistical power, the minimum sample size required for the study is determined. This approach is potentially flawed for the precision crisis due to choosing of arbitrary cutoff as level of significance and due to dependence of statistical power for detecting the difference on sample size. Moreover, p value does not tell about the magnitude of the difference at all. Therefore, one must be aware of these errors and their role in making statistical decisions.
    Statistical power
    p-value
    Null (SQL)
    Alternative hypothesis
    Cut-off
    Sample (material)
    Multiple comparisons problem
    Value (mathematics)
    Citations (0)
    Replicability is a fundamental quality of scientific discoveries. While meta-analysis provides a framework to evaluate the strength of signals across multiple studies accounting for experimental variability, it does not investigate replicability. A single, possibly non-reproducible study, can be enough to bring significance. In contrast, the partial conjunction (PC) alternative hypothesis stipulates that for a chosen number $r$ ($r > 1$), at least $r$ out of $n$ related individual hypotheses are non-null, making it a useful measure of replicability. Motivated by genetics problems, we consider settings where a large number $M$ of partial conjunction null hypotheses are tested, using an $n\times M$ matrix of $p$-values where $n$ is the number of studies. Applying multiple testing adjustments directly to PC $p$-values can be very conservative. We here introduce AdaFilter, a new procedure that, mindful of the fact that the PC null is a composite hypothesis, increases power by filtering out unlikely candidate PC hypotheses using the whole $p$-value matrix. We prove that appropriate versions of AdaFilter control the familywise error rate and the per family error rate under independence. We show that these error rates and the false discovery rate can be controlled under independence and a within-study local dependence structure while achieving much higher power than existing methods. We illustrate the effectiveness of the AdaFilter procedures with three different case studies.
    False Discovery Rate
    Multiple comparisons problem
    Null (SQL)
    Statistical power
    Independence
    Matrix (chemical analysis)
    Citations (3)
    When testing a single hypothesis, it is common knowledge that increasing the sample size after nonsignificant results and repeating the hypothesis test several times at unadjusted critical levels inflates the overall Type I error rate severely. In contrast, if a large number of hypotheses are tested controlling the False Discovery Rate, such “hunting for significance” has asymptotically no impact on the error rate. More specifically, if the sample size is increased for all hypotheses simultaneously and only the test at the final interim analysis determines which hypotheses are rejected, a data dependent increase of sample size does not affect the False Discovery Rate. This holds asymptotically (for an increasing number of hypotheses) for all scenarios but the global null hypothesis where all hypotheses are true. To control the False Discovery Rate also under the global null hypothesis, we consider stopping rules where stopping before a predefined maximum sample size is reached is possible only if sufficiently many null hypotheses can be rejected. The procedure is illustrated with several datasets from microarray experiments.
    False Discovery Rate
    Multiple comparisons problem
    Alternative hypothesis
    Null (SQL)
    Sample (material)
    Interim
    Citations (16)
    It is a typical feature of high dimensional data analysis, for example a microarray study, that a researcher allows thousands of statistical tests at a time. All inferences for the tests are determined using the p-values; a smaller p-value than the α-level of the test signifies a statistically significant test. As the number of tests increases, the chance of observing some small p-values is very high even when all null hypotheses are true. Consequently, we make wrong conclusions on the hypotheses. This type of potential problem frequently happens when we test several hypotheses simultaneously, i.e., the multiple testing problem. Adjustment of the p-values can redress the problem that arises in multiple hypothesis testing. P-value adjustment methods control error rates [type I error (i.e. false positive) and type II error (i.e. false negative)] for each hypothesis in order to achieve high statistical power while keeping the overall Family Wise Error Rate (FWER) no larger than α, where α is most often set to 0.05. However, researchers also consider the False Discovery Rate (FDR), or Positive False Discovery Rate (pFDR) instead of the type I error in multiple comparison problems for microarray studies. The methods involved in controlling the FDR always provide higher statistical power than the methods involved in controlling the type I error rate while keeping the type II error rate low. In practice, microarray studies involve dependent test statistics (or p-values) because genes can be fully dependent on each other in a complicated biological structure. However, some of the p-value adjustment methods only deal with independent test statistics. Thus, we carry out a simulation study with several methods involved in multiple hypothesis testing.
    False Discovery Rate
    Multiple comparisons problem
    p-value
    Statistical power
    Word error rate
    False positive rate
    Nominal level
    Citations (0)
    In this chapter we discuss the problem of identifying differentially expressed genes from a set of microarray experiments. Statistically speaking, this task falls under the heading of "multiple hypothesis testing." In other words, we must perform hypothesis tests on all genes simultaneously to determine whether each one is differentially expressed. Recall that in statistical hypothesis testing, we test a null hypothesis vs an alternative hypothesis. In this example, the null hypothesis is that there is no change in expression levels between experimental conditions. The alternative hypothesis is that there is some change. We reject the null hypothesis if there is enough evidence in favor of the alternative. This amounts to rejecting the null hypothesis if its corresponding statistic falls into some predetermined rejection region. Hypothesis testing is also concerned with measuring the probability of rejecting the null hypothesis when it is really true (called a false positive), and the probability of rejecting the null hypothesis when the alternative hypothesis is really true (called power).
    Alternative hypothesis
    Null (SQL)
    Multiple comparisons problem
    Statistical power
    False Discovery Rate
    p-value
    Citations (257)
    Abstract When a large number of statistical tests is performed, the chance of false positive findings could increase considerably. The traditional approach is to control the probability of rejecting at least one true null hypothesis, the familywise error rate (FWE). To improve the power of detecting treatment differences, an alternative approach is to control the expected proportion of errors among the rejected hypotheses, the false discovery rate (FDR). When some of the hypotheses are not true, the error rate from either the FWE- or the FDR-controlling procedure is usually lower than the designed level. This paper compares five methods used to estimate the number of true null hypotheses over a large number of hypotheses. The estimated number of true null hypotheses is then used to improve the power of FWE- or FDR-controlling methods. Monte Carlo simulations are conducted to evaluate the performance of these methods. The lowest slope method, developed by Benjamini and Hochberg (2000) on the adaptive control of the FDR in multiple testing with independent statistics, and the mean of differences method appear to perform the best. These two methods control the FWE properly when the number of nontrue null hypotheses is small. A data set from a toxicogenomic microarray experiment is used for illustration.
    False Discovery Rate
    Multiple comparisons problem
    Null (SQL)
    Statistical power
    Alternative hypothesis
    Citations (110)
    Abstract A primary concern with testing differential item functioning (DIF) using a traditional point-null hypothesis is that a statistically significant result does not imply that the magnitude of DIF is of practical interest. Similarly, for a given sample size, a non-significant result does not allow the researcher to conclude the item is free of DIF. To address these weaknesses, two types of range-null hypotheses utilizing Lord's χ2 DIF statistic were presented. The first type tests a null hypothesis whose rejection implies the item exhibits a meaningful magnitude of DIF, while the second type tests a null hypothesis whose rejection implies the item is effectively free of DIF. A simulation study was performed to evaluate the empirical Type I error rate and power of both types of range-null hypothesis tests under two crossed factors: test length (20 and 60 items) and sample size per group (2500, 5000, 10,000, and 20,000 examinees). The proposed statistic controlled the Type I error rates over all conditions and demonstrated acceptable power for sample size conditions of 5000 and larger. The implications of using the range-null hypothesis approach in practice are discussed. Keywords: differential item functioningrange-null hypothesisequivalence testgood-enough principalRasch model
    Differential item functioning
    Null (SQL)
    Alternative hypothesis
    Statistic
    Statistical power
    Null model
    Sample (material)
    Citations (5)