Powering Reproducible Research
12
Citation
36
Reference
10
Related Paper
Citation Trend
Keywords:
Null (SQL)
Statistical power
Scrutiny
Alternative hypothesis
p-value
Null model
The classical definition of statistical significance is p <= 0.05, meaning a 1/20 chance the test statistic found is due to normal variation of the null hypothesis. This definition of statistical significance does not represent the likelihood that the alternative hypothesis is true. Hypothesis testing can be evaluated using a 2x2 table (shown below). Box "a" = true positives: p <= 0.05 and the alternative hypothesis is true. This is the study's power. A rule of thumb is that study power should be at least 80% (80% of the time the statistical test is positive when the alternative hypothesis is true). Therefore a = 0.80. Box "b" = false-positives: p <= 0.05 but the alternative hypothesis is false. By definition, when p = 0.05 the test statistic has a 5% probability of occurring by chance when the null hypothesis is true. Therefore, b = 0.05. Box "c" = false-negatives: p >= 0.05 but the alternative hypothesis is true. This occurs 20% of the time when the study's power is 80%. Therefore, c = 0.20. Box "d" = true-negatives: p >= 0.05 and the null hypothesis is true. This occurs 95% of the time when p <= 0.05. Therefore, d = 0.95. From this table we derive: Sensitivity = power = a/(a+c) = 80%. Specificity = (1-p) = d/(b+d) = 95%. Positive predictive value = power/(power + p-value) = a/(a+b) = 94%. Negative predictive value = d/(c+d) = 83%. The classical definition of statistical significance is (1-specificity) and does not take power into consideration. The proposed new definition of statistical significance is when the positive predictive value of a test statistic is 95% or greater. To arrive at this, the cut-off p-value representing statistical significance needs to be corrected for study power so that 0.05 > (p-value)/(p-value + power). To achieve a 95% predictive confidence,it can be derived that statistical significance is a p-value <= power / 19.
p-value
Statistical power
Alternative hypothesis
Null (SQL)
Statistic
Value (mathematics)
Multiple comparisons problem
Cite
Citations (0)
Abstract One of the most common statistical procedures in the behavioral and social sciences is testing the hypothesis that treatments or interventions have no effect, or that the correlation between two variables is equal to zero—that is, tests of the null hypothesis (H 0 ). There are two ways to make errors when testing a null hypothesis: (1) Type I error, which consists of rejecting the null hypothesis when it is in fact true, and (2) Type II error, which consists of failing to reject the null hypothesis when it is in fact false. Statistical power is defined as 1 minus the conditional probability of making a Type II error. That is, power is the probability of rejecting the null hypothesis when it is in fact false.
Statistical power
Null (SQL)
Alternative hypothesis
p-value
Zero (linguistics)
Cite
Citations (0)
The comparison of allocated treatments is a foundation for many studies in soil science. We often want to determine if a treatment applied to a soil has a significant effect compared with not applying that treatment (control). Often, a statistical test is performed to establish whether there is a true difference in the means of the treatment and control. The null hypothesis states that there is no difference in the means. There are two types of error that can be made when a statistical test is conducted. These are presented in Table 1. In soil science research, there is an overwhelming focus on controlling the Type I error (i.e., the value of a). When a comparison test yields a non-significant result, the usual reaction is to assume that there was no true affect of applying the treatment to the soil. What is often ignored is the risk of Type II error; that is, a true difference is not detected. In this case, the power was not sufficient to be able to detect the difference that existed. Power is defined as the probability of rejecting the null hypothesis. This probability depends on the magnitude of the true difference in the treatment means. A larger difference is easier to detect and will thus have higher power. A Type II error occurs when there is a failure to reject a false null hypothesis, and the risk of committing this error is higher when the number of treatment replicates is low (Barker Bausell and Li 2002). Soils are highly, spatially variable at multiple scales both laterally and vertically. There are also analytical errors which can impose variation in the laboratory. The number of treatment replicates must be sufficiently large so that the effect of the treatment can over-ride this inherent variability. Without sufficient replication to lower the variation, the null hypothesis will not be rejected. An example of a research area in soil science where Type II error is common is in the assessment of management effects on soil organic carbon content (SOC). Typically, long-term agroecosystem experiments (LTAEs) have been used to assess management effects on SOC (e.g., Janzen et al. 1998; VandenBygaart et al. 2011). The LTAEs were not usually initiated for the purposes of assessing SOC differences, but for agronomic purposes. They are typically randomized block designs with relatively small plot sizes, such that blocking effects can yield large variations in SOC between replicated plots (Ellert et al. 2007). Expected changes in SOC due to management practices can be small, and often require decades to detect (Janzen et al. 1998). This small effect size coupled with large betweenreplicated plot variation causes the power to decrease, and increases the chances of Type II error. Indeed, in some instances there can be significant effects of a certain management practice on SOC with low power to detect a difference in LTAEs. Yet, where there is not a difference detected between treatments, the researcher may have been inclined to accept that there was not a difference and thus be done with it. However, by reaching such a conclusion without conducting adequate statistical power analysis, there is a risk of not observing a difference that was actually present. Recently, there has been a concern that such errors can lead to erroneous interpretations of data, which can have implications on agricultural policy (e.g., VandenBygaart 2009). Kravchenko and Robertson (2011) highlight some examples of the recent literature where lack of statistical power can lead to erroneous interpretation of results. There are several approaches that soil scientists can take to ensure that Type II errors are reduced to acceptable levels. The first is to effectively control experimental variation amongst plots. This is achieved by effective blocking in the design of the experiment and by the use of covariates in the statistical analysis to control variation not removed by blocking. The second step is to determine the required number of replicate plots needed to ensure that power is acceptably high, after having reduced the experimental error variance to a minimum through effective blocking and covariate analysis. One may also exploit ‘‘hidden replication’’, through the use of factorial experiments, to increase power (e.g. Astatkie et al. 2006). Blocking is very widely used in field experiments. Commonly used blocked designs include both complete and incomplete blocks, Latin squares and split plots. Many excellent references exist to guide the choice of experimental design (for example, Kuehl 2000). Suffice to say that the goal in laying out blocks is always to minimize the variation amongst plots within a block and hence to maximize the variation amongst blocks, prior to applying the treatments. Of course, the plots within a
Statistical power
p-value
Null (SQL)
Cite
Citations (9)
We thank Dr McCulloch1 for his interest in our recent Statistical Minute on sample size and power in clinical research.2 We completely agree that α is not the probability of committing a type I error if a study is positive, and that β is not the probability of committing a type II error when a study is negative. However, it is not correct that we are restating a common misunderstanding regarding the interpretation of these parameters. In fact, in this Statistical Minute, we have made neither of the claims that Dr McCulloch1 correctly criticizes as being incorrect. As we described in this Statistical Minute, α is the predetermined probability of rejecting the null hypothesis when it is in fact true, and β is the probability of not rejecting the null hypothesis when it is in fact false.2,3 This italicized part of the sentence is crucial. It means that the probabilities are conditional on the null hypothesis being either true or false, respectively, in the population of interest (not in the sample data). Therefore, it cannot be the probabilities that some observed test result is a false-positive or false-negative result: when the null hypothesis is true in the population, any observed “significant” result must be a type I error. Conversely, when the alternative hypothesis is true in the population of interest (ie, the null is false), any nonsignificant test result must be a type II error. In the analogy to diagnostic testing used by Dr McCulloch,1 1 − α (not 1 − P, as stated in the letter by Dr McCulloch1) corresponds to the specificity. While P and α are often confused, it is important to clearly distinguish between them. We refer to previous literature for detail on what P values actually represent4—in a nutshell, it is the probability of observing a result as extreme or more extreme as the one observed under the scenario that the null hypothesis was actually true. So a P value depends on the data, whereas α is set in the design of a study, independent of the data. Likewise, power is analogous to sensitivity of the test (not specificity, as stated in the letter by Dr McCulloch1). We refer to Figure 1 of a recent statistical tutorial on diagnostic testing in Anesthesia & Analgesia which clearly shows the analogy.5 Dr McCulloch1 is correct that it can be helpful to consider the previous probability or belief about the research hypothesis to interpret study results, as done using Bayesian analysis, and it is important to realize that the probability of a particular “significant” study result being a false positive can be considerably higher or lower than α. There are also many situations in which there is no reliable “prior” for a research hypothesis, and in which the frequentist approach is, therefore, most appropriate. However, we have not claimed, nor was it our intention to claim, that 1 − α and 1 − β refer to positive or negative predictive values. If this was not clear enough, we thank Dr McCulloch1 for the opportunity to explain in more detail. Patrick Schober, MD, PhD, MMedStatDepartment of AnesthesiologyAmsterdam UMCVrije Universiteit AmsterdamAmsterdam, the Netherlands[email protected] Thomas R. Vetter, MD, MPHDepartment of Surgery and Perioperative CareDell Medical School at the University of Texas at AustinAustin, Texas
Statistical power
Alternative hypothesis
Null (SQL)
p-value
Conditional probability
Cite
Citations (0)
This article explores basic statistical concepts of clinical trial design and diagnostic testing, or how one starts with a question, formulates it into a hypothesis on which a clinical trial is then built, and integrates it with statistics and probability, such as determining the probability of rejecting the null hypothesis when it is actually true (type I error) and the probability of failing to reject the null hypothesis when it is false (type II error). There are a variety of tests for different types of data, and the appropriate test must be chosen for which the sample data meet the assumptions. Correcting type I error in the presence of multiple testing is needed to control the error's inflation. Within diagnostic testing, identifying false-positive and false-negative results is critical to understanding the performance of a test. These are used to determine the sensitivity and specificity of a test along with the test's negative predictive value and positive predictive value. These quantities, specifically sensitivity and specificity, are used to determine the accuracy of a diagnostic test using receiver-operating-characteristic curves. These concepts are briefly introduced to provide a basic understanding of clinical trial design and analysis, with references to allow the reader to explore various concepts at a more detailed level if desired.
p-value
Statistical power
Null (SQL)
Alternative hypothesis
Multiple comparisons problem
Cite
Citations (8)
In hypothesis testing, statistical significance is typically based on calculations involving p-values and Type I error rates. A p-value calculated from a single statistical hypothesis test can be used to determine whether there is statistically significant evidence against the null hypothesis. The upper threshold applied to the p-value in making this determination (often 5% in the scientific literature) determines the Type I error rate; i.e., the probability of making a Type I error when the null hypothesis is true. Multiple hypothesis testing is concerned with testing several statistical hypotheses simultaneously. Defining statistical significance is a more complex problem in this setting. A longstanding definition of statistical significance for multiple hypothesis tests involves the probability of making one or more Type I errors among the family of hypothesis tests, called the family-wise error rate. However, there exist other well established formulations of statistical significance for multiple hypothesis tests. The Bayesian framework for classification naturally allows one to calculate the probability that each null hypothesis is true given the observed data (Efron et al. 2001, Storey 2003), and several frequentist definitions of multiple hypothesis testing significance are also well established (Shaffer 1995). Soric (1989) proposed a framework for quantifying the statistical significance of multiple hypothesis tests based on the proportion of Type I errors among all hypothesis tests called statistically significant. He called statistically significant hypothesis tests discoveries and proposed that one be concerned about the rate of false discoveries1 when testing multiple hypotheses. This false discovery rate is robust to the false positive paradox and is particularly useful in exploratory analyses, where one is more concerned with having mostly true findings among a set of statistically significant discoveries rather than guarding against one or more false positives. Benjamini & Hochberg (1995) provided the first implementation of false discovery rates with known operating characteristics. The idea of quantifying the rate of false discoveries is directly related to several pre-existing ideas, such as Bayesian misclassification rates and the positive predictive value (Storey 2003).
p-value
Alternative hypothesis
False Discovery Rate
Multiple comparisons problem
Statistical power
Null (SQL)
Cite
Citations (61)
It is well recognised that low statistical power increases the probability of type II error, that is it reduces the probability of detecting a difference between groups, where a difference exists. Paradoxically, low statistical power also increases the likelihood that a statistically significant finding is actually falsely positive (for a given p-value). Hence, ethical concerns regarding studies with low statistical power should include the increased risk of type I error in such studies reporting statistically significant effects. This paper illustrates the effect of low statistical power by comparing hypothesis testing with diagnostic test evaluation using concepts familiar to clinicians, such as positive and negative predicative values. We also note that, where there is a high probability that the null hypothesis is true, statistically significant findings are even more likely to be falsely positive.
Statistical power
Predicative expression
p-value
Statistical Analysis
Nominal level
Alternative hypothesis
Value (mathematics)
Cite
Citations (156)
In hypothesis testing, the p value is in routine use as a tool to make statistical decisions. It gathers evidence to reject null hypothesis. Although it is supposed to reject the null hypothesis when it is false and fail to reject the null hypothesis when it is true but there is a potential to err by incorrectly rejecting the true null hypothesis and wrongly not rejecting the null hypothesis even when it is false. These are named as type I and type II errors respectively. The type I error (α error) is chosen arbitrarily by the researcher before the start of the experiment which serves as an arbitrary cutoff to bifurcate the entire quantitative data into two qualitative groups as 'significant' and 'insignificant'. This is known as level of significance (α level). Type II error (β error) is also predetermined so that the statistical test should have enough statistical power ((1-β)) to detect the statistically significant difference. In order to achieve adequate statistical power, the minimum sample size required for the study is determined. This approach is potentially flawed for the precision crisis due to choosing of arbitrary cutoff as level of significance and due to dependence of statistical power for detecting the difference on sample size. Moreover, p value does not tell about the magnitude of the difference at all. Therefore, one must be aware of these errors and their role in making statistical decisions.
Statistical power
p-value
Null (SQL)
Alternative hypothesis
Cut-off
Sample (material)
Multiple comparisons problem
Value (mathematics)
Cite
Citations (0)
It is a typical feature of high dimensional data analysis, for example a microarray study, that a researcher allows thousands of statistical tests at a time. All inferences for the tests are determined using the p-values; a smaller p-value than the α-level of the test signifies a statistically significant test. As the number of tests increases, the chance of observing some small p-values is very high even when all null hypotheses are true. Consequently, we make wrong conclusions on the hypotheses. This type of potential problem frequently happens when we test several hypotheses simultaneously, i.e., the multiple testing problem. Adjustment of the p-values can redress the problem that arises in multiple hypothesis testing. P-value adjustment methods control error rates [type I error (i.e. false positive) and type II error (i.e. false negative)] for each hypothesis in order to achieve high statistical power while keeping the overall Family Wise Error Rate (FWER) no larger than α, where α is most often set to 0.05. However, researchers also consider the False Discovery Rate (FDR), or Positive False Discovery Rate (pFDR) instead of the type I error in multiple comparison problems for microarray studies. The methods involved in controlling the FDR always provide higher statistical power than the methods involved in controlling the type I error rate while keeping the type II error rate low. In practice, microarray studies involve dependent test statistics (or p-values) because genes can be fully dependent on each other in a complicated biological structure. However, some of the p-value adjustment methods only deal with independent test statistics. Thus, we carry out a simulation study with several methods involved in multiple hypothesis testing.
False Discovery Rate
Multiple comparisons problem
p-value
Statistical power
Word error rate
False positive rate
Nominal level
Cite
Citations (0)
Thus, a rational approach to hypothesis testing will seek to reject a hypothesis if it is false and accept a hypothesis when it is true. The two types of error are rejecting the null hypothesis when it is true (Type I) and accepting the null hypothesis when it is false (Type II). In the Neyman-Pearson theory, it is usual to fix the Type I error probability (α) at some constant (often at 0.05, but not necessarily), and then choose a test which minimises the Type II error probability (β), conditional on α. The (null) hypothesis is then either rejected when the associated p-value for the test is less than α, or otherwise accepted.
p-value
Alternative hypothesis
Statistical power
Null (SQL)
Value (mathematics)
Cite
Citations (0)