Fundamental Statistical Concepts in Clinical Trials and Diagnostic Testing
8
Citation
35
Reference
10
Related Paper
Citation Trend
Abstract:
This article explores basic statistical concepts of clinical trial design and diagnostic testing, or how one starts with a question, formulates it into a hypothesis on which a clinical trial is then built, and integrates it with statistics and probability, such as determining the probability of rejecting the null hypothesis when it is actually true (type I error) and the probability of failing to reject the null hypothesis when it is false (type II error). There are a variety of tests for different types of data, and the appropriate test must be chosen for which the sample data meet the assumptions. Correcting type I error in the presence of multiple testing is needed to control the error's inflation. Within diagnostic testing, identifying false-positive and false-negative results is critical to understanding the performance of a test. These are used to determine the sensitivity and specificity of a test along with the test's negative predictive value and positive predictive value. These quantities, specifically sensitivity and specificity, are used to determine the accuracy of a diagnostic test using receiver-operating-characteristic curves. These concepts are briefly introduced to provide a basic understanding of clinical trial design and analysis, with references to allow the reader to explore various concepts at a more detailed level if desired.Keywords:
p-value
Statistical power
Null (SQL)
Alternative hypothesis
Multiple comparisons problem
Summary During the last two decades, an increase of tobacco product reporting requirements from regulators was observed, such as Europe, Canada or USA. However, the capacity to compare and discriminate accurately two products is impacted by the number of constituents used for the comparison. Indeed, performing a large number of simultaneous independent hypothesis tests increases the probability of rejection of the null hypothesis when it should not be rejected. This leads to virtually guarantee the presence of type I errors among the findings. Correction methods have been developed to overcome this issue like the Bonferroni or Benjamini & Hochberg ones. The performance of these methods was assessed by comparing identical tobacco products with different sizes of data sets. Results showed that multiple comparisons lead to erroneous conclusions if the risk of type I error is not corrected. Unfortunately, reducing the type I error impacts the statistical power of the tests. Consequently, strategies for dealing with multiplicity of data should provide a reasonable balance between testing requirement and statistical power of differentiation. Multiple testing for product comparison is less of a problem if studies restrict to the most relevant parameters for comparison.
Bonferroni correction
Multiple comparisons problem
Statistical power
Null (SQL)
Product type
Alternative hypothesis
Cite
Citations (1)
Abstract One of the most common statistical procedures in the behavioral and social sciences is testing the hypothesis that treatments or interventions have no effect, or that the correlation between two variables is equal to zero—that is, tests of the null hypothesis (H 0 ). There are two ways to make errors when testing a null hypothesis: (1) Type I error, which consists of rejecting the null hypothesis when it is in fact true, and (2) Type II error, which consists of failing to reject the null hypothesis when it is in fact false. Statistical power is defined as 1 minus the conditional probability of making a Type II error. That is, power is the probability of rejecting the null hypothesis when it is in fact false.
Statistical power
Null (SQL)
Alternative hypothesis
p-value
Zero (linguistics)
Cite
Citations (0)
We thank Dr McCulloch1 for his interest in our recent Statistical Minute on sample size and power in clinical research.2 We completely agree that α is not the probability of committing a type I error if a study is positive, and that β is not the probability of committing a type II error when a study is negative. However, it is not correct that we are restating a common misunderstanding regarding the interpretation of these parameters. In fact, in this Statistical Minute, we have made neither of the claims that Dr McCulloch1 correctly criticizes as being incorrect. As we described in this Statistical Minute, α is the predetermined probability of rejecting the null hypothesis when it is in fact true, and β is the probability of not rejecting the null hypothesis when it is in fact false.2,3 This italicized part of the sentence is crucial. It means that the probabilities are conditional on the null hypothesis being either true or false, respectively, in the population of interest (not in the sample data). Therefore, it cannot be the probabilities that some observed test result is a false-positive or false-negative result: when the null hypothesis is true in the population, any observed “significant” result must be a type I error. Conversely, when the alternative hypothesis is true in the population of interest (ie, the null is false), any nonsignificant test result must be a type II error. In the analogy to diagnostic testing used by Dr McCulloch,1 1 − α (not 1 − P, as stated in the letter by Dr McCulloch1) corresponds to the specificity. While P and α are often confused, it is important to clearly distinguish between them. We refer to previous literature for detail on what P values actually represent4—in a nutshell, it is the probability of observing a result as extreme or more extreme as the one observed under the scenario that the null hypothesis was actually true. So a P value depends on the data, whereas α is set in the design of a study, independent of the data. Likewise, power is analogous to sensitivity of the test (not specificity, as stated in the letter by Dr McCulloch1). We refer to Figure 1 of a recent statistical tutorial on diagnostic testing in Anesthesia & Analgesia which clearly shows the analogy.5 Dr McCulloch1 is correct that it can be helpful to consider the previous probability or belief about the research hypothesis to interpret study results, as done using Bayesian analysis, and it is important to realize that the probability of a particular “significant” study result being a false positive can be considerably higher or lower than α. There are also many situations in which there is no reliable “prior” for a research hypothesis, and in which the frequentist approach is, therefore, most appropriate. However, we have not claimed, nor was it our intention to claim, that 1 − α and 1 − β refer to positive or negative predictive values. If this was not clear enough, we thank Dr McCulloch1 for the opportunity to explain in more detail. Patrick Schober, MD, PhD, MMedStatDepartment of AnesthesiologyAmsterdam UMCVrije Universiteit AmsterdamAmsterdam, the Netherlands[email protected] Thomas R. Vetter, MD, MPHDepartment of Surgery and Perioperative CareDell Medical School at the University of Texas at AustinAustin, Texas
Statistical power
Alternative hypothesis
Null (SQL)
p-value
Conditional probability
Cite
Citations (0)
This article explores basic statistical concepts of clinical trial design and diagnostic testing, or how one starts with a question, formulates it into a hypothesis on which a clinical trial is then built, and integrates it with statistics and probability, such as determining the probability of rejecting the null hypothesis when it is actually true (type I error) and the probability of failing to reject the null hypothesis when it is false (type II error). There are a variety of tests for different types of data, and the appropriate test must be chosen for which the sample data meet the assumptions. Correcting type I error in the presence of multiple testing is needed to control the error's inflation. Within diagnostic testing, identifying false-positive and false-negative results is critical to understanding the performance of a test. These are used to determine the sensitivity and specificity of a test along with the test's negative predictive value and positive predictive value. These quantities, specifically sensitivity and specificity, are used to determine the accuracy of a diagnostic test using receiver-operating-characteristic curves. These concepts are briefly introduced to provide a basic understanding of clinical trial design and analysis, with references to allow the reader to explore various concepts at a more detailed level if desired.
p-value
Statistical power
Null (SQL)
Alternative hypothesis
Multiple comparisons problem
Cite
Citations (8)
In hypothesis testing, statistical significance is typically based on calculations involving p-values and Type I error rates. A p-value calculated from a single statistical hypothesis test can be used to determine whether there is statistically significant evidence against the null hypothesis. The upper threshold applied to the p-value in making this determination (often 5% in the scientific literature) determines the Type I error rate; i.e., the probability of making a Type I error when the null hypothesis is true. Multiple hypothesis testing is concerned with testing several statistical hypotheses simultaneously. Defining statistical significance is a more complex problem in this setting. A longstanding definition of statistical significance for multiple hypothesis tests involves the probability of making one or more Type I errors among the family of hypothesis tests, called the family-wise error rate. However, there exist other well established formulations of statistical significance for multiple hypothesis tests. The Bayesian framework for classification naturally allows one to calculate the probability that each null hypothesis is true given the observed data (Efron et al. 2001, Storey 2003), and several frequentist definitions of multiple hypothesis testing significance are also well established (Shaffer 1995). Soric (1989) proposed a framework for quantifying the statistical significance of multiple hypothesis tests based on the proportion of Type I errors among all hypothesis tests called statistically significant. He called statistically significant hypothesis tests discoveries and proposed that one be concerned about the rate of false discoveries1 when testing multiple hypotheses. This false discovery rate is robust to the false positive paradox and is particularly useful in exploratory analyses, where one is more concerned with having mostly true findings among a set of statistically significant discoveries rather than guarding against one or more false positives. Benjamini & Hochberg (1995) provided the first implementation of false discovery rates with known operating characteristics. The idea of quantifying the rate of false discoveries is directly related to several pre-existing ideas, such as Bayesian misclassification rates and the positive predictive value (Storey 2003).
p-value
Alternative hypothesis
False Discovery Rate
Multiple comparisons problem
Statistical power
Null (SQL)
Cite
Citations (61)
When many tests of significance are examined in a research investigation with procedures that limit the probability of making at least one Type I error--the so-called familywise techniques of control--the likelihood of detecting effects can be very low. That is, when familywise error controlling methods are adopted to assess statistical significance, the size of the critical value that must be exceeded in order to obtain statistical significance can be extremely large when the number of tests to be examined is also very large. In our investigation we examined three methods for increasing the sensitivity to detect effects when family size is large: the false discovery rate of error control presented by Benjamini and Hochberg (1995), a modified false discovery rate presented by Benjamini and Hochberg (2000) which estimates the number of true null hypotheses prior to adopting false discovery rate control, and a familywise method modified to control the probability of committing two or more Type I errors in the family of tests examined--not one, as is the case with the usual familywise techniques. Our results indicated that the level of significance for the two or more familywise method of Type I error control varied with the testing scenario and needed to be set on occasion at values in excess of 0.15 in order to control the two or more rate at a reasonable value of 0.01. In addition, the false discovery rate methods typically resulted in substantially greater power to detect non-null effects even though their levels of significance were set at the standard 0.05 value. Accordingly, we recommend the Benjamini and Hochberg (1995, 2000) methods of Type I error control when the number of tests in the family is large.
False Discovery Rate
Multiple comparisons problem
Statistical power
p-value
Word error rate
Null (SQL)
Statistical Process Control
False positive rate
Cite
Citations (157)
In hypothesis testing, the p value is in routine use as a tool to make statistical decisions. It gathers evidence to reject null hypothesis. Although it is supposed to reject the null hypothesis when it is false and fail to reject the null hypothesis when it is true but there is a potential to err by incorrectly rejecting the true null hypothesis and wrongly not rejecting the null hypothesis even when it is false. These are named as type I and type II errors respectively. The type I error (α error) is chosen arbitrarily by the researcher before the start of the experiment which serves as an arbitrary cutoff to bifurcate the entire quantitative data into two qualitative groups as 'significant' and 'insignificant'. This is known as level of significance (α level). Type II error (β error) is also predetermined so that the statistical test should have enough statistical power ((1-β)) to detect the statistically significant difference. In order to achieve adequate statistical power, the minimum sample size required for the study is determined. This approach is potentially flawed for the precision crisis due to choosing of arbitrary cutoff as level of significance and due to dependence of statistical power for detecting the difference on sample size. Moreover, p value does not tell about the magnitude of the difference at all. Therefore, one must be aware of these errors and their role in making statistical decisions.
Statistical power
p-value
Null (SQL)
Alternative hypothesis
Cut-off
Sample (material)
Multiple comparisons problem
Value (mathematics)
Cite
Citations (0)
It is a typical feature of high dimensional data analysis, for example a microarray study, that a researcher allows thousands of statistical tests at a time. All inferences for the tests are determined using the p-values; a smaller p-value than the α-level of the test signifies a statistically significant test. As the number of tests increases, the chance of observing some small p-values is very high even when all null hypotheses are true. Consequently, we make wrong conclusions on the hypotheses. This type of potential problem frequently happens when we test several hypotheses simultaneously, i.e., the multiple testing problem. Adjustment of the p-values can redress the problem that arises in multiple hypothesis testing. P-value adjustment methods control error rates [type I error (i.e. false positive) and type II error (i.e. false negative)] for each hypothesis in order to achieve high statistical power while keeping the overall Family Wise Error Rate (FWER) no larger than α, where α is most often set to 0.05. However, researchers also consider the False Discovery Rate (FDR), or Positive False Discovery Rate (pFDR) instead of the type I error in multiple comparison problems for microarray studies. The methods involved in controlling the FDR always provide higher statistical power than the methods involved in controlling the type I error rate while keeping the type II error rate low. In practice, microarray studies involve dependent test statistics (or p-values) because genes can be fully dependent on each other in a complicated biological structure. However, some of the p-value adjustment methods only deal with independent test statistics. Thus, we carry out a simulation study with several methods involved in multiple hypothesis testing.
False Discovery Rate
Multiple comparisons problem
p-value
Statistical power
Word error rate
False positive rate
Nominal level
Cite
Citations (0)
Cite
Citations (0)
Thus, a rational approach to hypothesis testing will seek to reject a hypothesis if it is false and accept a hypothesis when it is true. The two types of error are rejecting the null hypothesis when it is true (Type I) and accepting the null hypothesis when it is false (Type II). In the Neyman-Pearson theory, it is usual to fix the Type I error probability (α) at some constant (often at 0.05, but not necessarily), and then choose a test which minimises the Type II error probability (β), conditional on α. The (null) hypothesis is then either rejected when the associated p-value for the test is less than α, or otherwise accepted.
p-value
Alternative hypothesis
Statistical power
Null (SQL)
Value (mathematics)
Cite
Citations (0)