Bayesian jackknife tests with a small number of subsets: application to HERA 21 cm power spectrum upper limits
Michael J. WilenskyFraser KennedyPhilip BullJoshua S. DillonZara AbdurashidovaTyrone AdamsJames E. AguirrePaul AlexanderZaki S. AliRushelle BaartmanYanga BalfourAdam P. BeardsleyG. BernardiTashalee S. BillingsJudd D. BowmanRichard F. BradleyJacob BurbaSteven CareyC. L. CarilliCarina ChengDavid R. DeBoerEloy de Lera AcedoMatt DexterNico EksteenJohn ElyAaron Ewall‐WiceNicolas FagnoniRandall FritzSteven R. FurlanettoKingsley Gale‐SidesBrian GlendenningDeepthi GorthiBradley GreigJasper GrobbelaarZiyaad HaldayB. J. HazeltonJacqueline N. HewittJ. HickishDaniel JacobsAustin JuliusMacCalvin KarisebNicholas S. KernJoshua KerriganPiyanat KittiwisitSaul A. KohnMatthew KolopanisAdam LanmanPaul La PlanteAdrian LiuAnita LootsDavid H. E. MacMahonLourence MalanCresshim MalgasKeith MalgasBradley MareroZachary E. MartinotAndrei MesingerMathakane MolewaM. F. MoralesTshegofalang MosianeSteven MurrayAbraham R. NebenBojan NikolicHans NuwegeldAaron R. ParsonsNipanjana PatraSamantha PieterseN. Razavi‐GhodsJames RobnettKathryn RosiePeter SimsHilton SwartsNithyanandan ThyagarajanPieter van WyngaardenPeter K. G. WilliamsHaoxuan Zheng
1
Citation
35
Reference
10
Related Paper
Citation Trend
Abstract:
ABSTRACT We present a Bayesian jackknife test for assessing the probability that a data set contains biased subsets, and, if so, which of the subsets are likely to be biased. The test can be used to assess the presence and likely source of statistical tension between different measurements of the same quantities in an automated manner. Under certain broadly applicable assumptions, the test is analytically tractable. We also provide an open-source code, chiborg, that performs both analytic and numerical computations of the test on general Gaussian-distributed data. After exploring the information theoretical aspects of the test and its performance with an array of simulations, we apply it to data from the Hydrogen Epoch of Reionization Array (HERA) to assess whether different sub-seasons of observing can justifiably be combined to produce a deeper 21 cm power spectrum upper limit. We find that, with a handful of exceptions, the HERA data in question are statistically consistent and this decision is justified. We conclude by pointing out the wide applicability of this test, including to CMB experiments and the H0 tension.Keywords:
Jackknife resampling
Statistical power
Null (SQL)
Statistical power
Scrutiny
Alternative hypothesis
p-value
Null model
Cite
Citations (12)
Statistical power
Affect
Statistical Analysis
Null (SQL)
Alternative hypothesis
Cite
Citations (5)
Calculations of the power of statistical tests are important in planning research studies (including meta-analyses) and in interpreting situations in which a result has not proven to be statistically significant. The authors describe procedures to compute statistical power of fixed- and random-effects tests of the mean effect size, tests for heterogeneity (or variation) of effect size parameters across studies, and tests for contrasts among effect sizes of different studies. Examples are given using 2 published meta-analyses. The examples illustrate that statistical power is not always high in meta-analysis.
Statistical power
Statistical Analysis
Variation (astronomy)
Statistical theory
Cite
Citations (657)
Although courts have incorporated statistical hypothesis testing into their evaluation of numerical evidence in a variety of cases, they have primarily focused on one aspect of a statistical analysis: whether or not the result is 'statistically significant' at the 0.05 or 'two-standard deviation' level. The theory underlying hypothesis testing is also concerned with the power of the test to detect a meaningful difference. This article shows that using the insights provided by power calculations should assist courts to better interpret and evaluate the statistical analyses submitted into evidence. In particular, the concept of power should help in assessing whether a sample is too small to provide reliable inferences. On the other hand very large samples can classify minor differences as statistically significant. This occurs when the power of the test at the standard 0.05 level is very high. It will be seen that requiring significance at a more stringent level, e.g. 0.005, which can be determined from a power calculation, often resolves this problem.
Statistical power
Sample (material)
Statistical Analysis
Statistical theory
Statistical Inference
Statistical evidence
Significance testing
Cite
Citations (15)
Statistical significance, effect size and statistical power carry great weight in quantitative educational research and this chapter addresses these three key factors. It explains what they are, what they mean and how to work with them. The chapter cautions against over-reliance on significance testing and indicates several concerns about null hypothesis significance testing (NHST). In addressing significance testing, and as an introduction to effect size, the chapter introduces correlational analysis, and advocates complementing significance testing with the use of effect size. It indicates different measures of effect size used with different statistics and how to calculate effect size, since measures of effect size, be they in terms of standardized units, original units or unit-free measures, vary according to the statistical tests used to calculate them. The chapter introduces statistical power, and, in doing so, clarifies Type I and Type II errors, what they mean and how to avoid them.
Statistical power
Statistical Analysis
Alternative hypothesis
Cite
Citations (7)
In hypothesis testing, the p value is in routine use as a tool to make statistical decisions. It gathers evidence to reject null hypothesis. Although it is supposed to reject the null hypothesis when it is false and fail to reject the null hypothesis when it is true but there is a potential to err by incorrectly rejecting the true null hypothesis and wrongly not rejecting the null hypothesis even when it is false. These are named as type I and type II errors respectively. The type I error (α error) is chosen arbitrarily by the researcher before the start of the experiment which serves as an arbitrary cutoff to bifurcate the entire quantitative data into two qualitative groups as 'significant' and 'insignificant'. This is known as level of significance (α level). Type II error (β error) is also predetermined so that the statistical test should have enough statistical power ((1-β)) to detect the statistically significant difference. In order to achieve adequate statistical power, the minimum sample size required for the study is determined. This approach is potentially flawed for the precision crisis due to choosing of arbitrary cutoff as level of significance and due to dependence of statistical power for detecting the difference on sample size. Moreover, p value does not tell about the magnitude of the difference at all. Therefore, one must be aware of these errors and their role in making statistical decisions.
Statistical power
p-value
Null (SQL)
Alternative hypothesis
Cut-off
Sample (material)
Multiple comparisons problem
Value (mathematics)
Cite
Citations (0)
Statistical power
Multiple comparisons problem
Alternative hypothesis
False Discovery Rate
Cite
Citations (3)
In comparing or assessing methods of treatment it is vital that the appropriate number of patients is selected in order to ensure that the conclusions drawn are statistically viable. This annotation describes the relevance of a statistical power analysis in the context of hypothesis testing to the determination of the optimal sample size of a study. The power of the test indicates how likely it is that the test will correctly produce a statistically significant result.
Statistical power
Relevance
Significance testing
Statistical Analysis
Multiple comparisons problem
Sample (material)
Alternative hypothesis
Cite
Citations (6)
Statistical hypothesis testing has recently come under fire, as invalid and inadequately powered tests have become more frequent and the results of statistical hypothesis testing have become harder to interpret. Following a review of the classic definition of a statistical test of a hypothesis, discussion of common ways in which this approach has gone awry is presented. The review concludes with a description of the structure to support valid and powerful statistical testing and the context in which power calculations are properly done.
Statistical power
Alternative hypothesis
Statistical Analysis
Statistical theory
Statistical evidence
Cite
Citations (4)
The Bonferroni procedure controls the risk of rejecting one or more true null hypotheses no matter how many significance tests are performed, but permits the risk of failing to reject false null hypotheses to grow with the number of tests. The loss of statistical power associated with the use of this procedure is demonstrated, and two options for alleviating the problem are explored. Setting a less stringent significance level for the set of tests is shown to be less effective than increasing the sample size.
Bonferroni correction
Statistical power
Null (SQL)
Multiple comparisons problem
Alternative hypothesis
Statistical Analysis
Cite
Citations (18)