Summary In multiple-testing problems, where a large number of hypotheses are tested simultaneously, false discovery rate (FDR) control can be achieved with the well-known Benjamini–Hochberg procedure, which a(0, 1]dapts to the amount of signal in the data, under certain distributional assumptions. Many modifications of this procedure have been proposed to improve power in scenarios where the hypotheses are organized into groups or into a hierarchy, as well as other structured settings. Here we introduce the ‘structure-adaptive Benjamini–Hochberg algorithm’ (SABHA) as a generalization of these adaptive testing methods. The SABHA method incorporates prior information about any predetermined type of structure in the pattern of locations of the signals and nulls within the list of hypotheses, to reweight the p-values in a data-adaptive way. This raises the power by making more discoveries in regions where signals appear to be more common. Our main theoretical result proves that the SABHA method controls the FDR at a level that is at most slightly higher than the target FDR level, as long as the adaptive weights are constrained sufficiently so as not to overfit too much to the data—interestingly, the excess FDR can be related to the Rademacher complexity or Gaussian width of the class from which we choose our data-adaptive weights. We apply this general framework to various structured settings, including ordered, grouped and low total variation structures, and obtain the bounds on the FDR for each specific setting. We also examine the empirical performance of the SABHA method on functional magnetic resonance imaging activity data and on gene–drug response data, as well as on simulated data.
Given a family of pretrained models and a hold-out set, how can we construct a valid conformal prediction set while selecting a model that minimizes the width of the set? If we use the same hold-out data set both to select a model (the model that yields the smallest conformal prediction sets) and then to construct a conformal prediction set based on that selected model, we suffer a loss of coverage due to selection bias. Alternatively, we could further splitting the data to perform selection and calibration separately, but this comes at a steep cost if the size of the dataset is limited. In this paper, we address the challenge of constructing a valid prediction set after efficiency-oriented model selection. Our novel methods can be implemented efficiently and admit finite-sample validity guarantees without invoking additional sample-splitting. We show that our methods yield prediction sets with asymptotically optimal size under certain notion of continuity for the model class. The improved efficiency of the prediction sets constructed by our methods are further demonstrated through applications to synthetic datasets in various settings and a real data example.
Knockoffs is a new framework for controlling the false discovery rate (FDR) in multiple hypothesis testing problems involving complex statistical models. While there has been great emphasis on Type-I error control, Type-II errors have been far less studied. In this paper we analyze the false negative rate or, equivalently, the power of a knockoff procedure associated with the Lasso solution path under an i.i.d. Gaussian design, and find that knockoffs asymptotically achieve close to optimal power with respect to an omniscient oracle. Furthermore, we demonstrate that for sparse signals, performing model selection via knockoff filtering achieves nearly ideal prediction errors as compared to a Lasso oracle equipped with full knowledge of the distribution of the unknown regression coefficients. The i.i.d. Gaussian design is adopted to leverage results concerning the empirical distribution of the Lasso estimates, which makes power calculation possible for both knockoff and oracle procedures.
The proposed spectral CT method solves the constrained one-step spectral CT reconstruction (cOSSCIR) optimization problem to estimate basis material maps while modeling the nonlinear X-ray detection process and enforcing convex constraints on the basis map images. In order to apply the optimization-based reconstruction approach to experimental data, the presented method empirically estimates the effective energy-window spectra using a calibration procedure. The amplitudes of the estimated spectra were further optimized as part of the reconstruction process to reduce ring artifacts. A validation approach was developed to select constraint parameters. The proposed spectral CT method was evaluated through simulations and experiments with a photon-counting detector. Basis material map images were successfully reconstructed using the presented empirical spectral modeling and cOSSCIR optimization approach. In simulations, the cOSSCIR approach accurately reconstructed the basis map images (<;1% error). In experiments, the proposed method estimated the low-density polyethylene region of the basis maps with 0.5% error in the PMMA image and 4% error in the aluminum image. For the Teflon region, the experimental results demonstrated 8% and 31% error in the PMMA and aluminum basis material maps, respectively, compared with -24% and 126% error without estimation of the effective energy window spectra, with residual errors likely due to insufficient modeling of detector effects. The cOSSCIR algorithm estimated the material decomposition angle to within 1.3 degree error, where, for reference, the difference in angle between PMMA and muscle tissue is 2.1 degrees. The joint estimation of spectral-response scaling coefficients and basis material maps was found to reduce ring artifacts in both a phantom and tissue specimen. The presented validation procedure demonstrated feasibility for the automated determination of algorithm constraint parameters.
We propose the group knockoff filter, a method for false discovery rate control in a linear regression setting where the features are grouped, and we would like to select a set of relevant groups which have a nonzero effect on the response. By considering the set of true and false discoveries at the group level, this method gains power relative to sparse regression methods. We also apply our method to the multitask regression problem where multiple response variables share similar sparsity patterns across the set of possible features. Empirically, the group knockoff filter successfully controls false discoveries at the group level in both settings, with substantially more discoveries made by leveraging the group structure.
The field of distribution-free predictive inference provides tools for provably valid prediction without any assumptions on the distribution of the data, which can be paired with any regression algorithm to provide accurate and reliable predictive intervals. The guarantees provided by these methods are typically marginal, meaning that predictive accuracy holds on average over both the training data set and the test point that is queried. However, it may be preferable to obtain a stronger guarantee of training-conditional coverage, which would ensure that most draws of the training data set result in accurate predictive accuracy on future test points. This property is known to hold for the split conformal prediction method. In this work, we examine the training-conditional coverage properties of several other distribution-free predictive inference methods, and find that training-conditional coverage is achieved by some methods but is impossible to guarantee without further assumptions for others.
Permutation tests date back nearly a century to Fisher's randomized experiments, and remain an immensely popular statistical tool, used for testing hypotheses of independence between variables and other common inferential questions. Much of the existing literature has emphasized that, for the permutation p-value to be valid, one must first pick a subgroup $G$ of permutations (which could equal the full group) and then recalculate the test statistic on permuted data using either an exhaustive enumeration of $G$, or a sample from $G$ drawn uniformly at random. In this work, we demonstrate that the focus on subgroups and uniform sampling are both unnecessary for validity -- in fact, a simple random modification of the permutation p-value remains valid even when using an arbitrary distribution (not necessarily uniform) over any subset of permutations (not necessarily a subgroup). We provide a unified theoretical treatment of such generalized permutation tests, recovering all known results from the literature as special cases. Thus, this work expands the flexibility of the permutation test toolkit available to the practitioner.
We develop tools for selective inference in the setting of group sparsity, including the construction of confidence intervals and p-values for testing selected groups of variables. Our main technical result gives the precise distribution of the magnitude of the projection of the data onto a given subspace, and enables us to develop inference procedures for a broad class of group-sparse selection methods, including the group lasso, iterative hard thresholding, and forward stepwise regression. We give numerical results to illustrate these tools on simulated data and on health record data.
Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm $A$ as a ``black box'' (i.e., we can only study the behavior of $A$ empirically), there is a fundamental limit on our ability to carry out inference on the performance of $A$, unless the number of available data points $N$ is many times larger than the sample size $n$ of interest. (On the other hand, evaluating the performance of a particular fitted model is easy as long as a holdout data set is available -- that is, as long as $N-n$ is not too small.) We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that this is not the case: the same hardness result still holds for the problem of evaluating the performance of $A$, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.