We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation as well as applications to electronic medical record and rainfall data.
The infinite-dimensional Hilbert sphere $S^\infty$ has been widely employed to model density functions and shapes, extending the finite-dimensional counterpart. We consider the Fr\'echet mean as an intrinsic summary of the central tendency of data lying on $S^\infty$. To break a path for sound statistical inference, we derive properties of the Fr\'echet mean on $S^\infty$ by establishing its existence and uniqueness as well as a root-$n$ central limit theorem (CLT) for the sample version, overcoming obstructions from infinite-dimensionality and lack of compactness on $S^\infty$. Intrinsic CLTs for the estimated tangent vectors and covariance operator are also obtained. Asymptotic and bootstrap hypothesis tests for the Fr\'echet mean based on projection and norm are then proposed and are shown to be consistent. The proposed two-sample tests are applied to make inference for daily taxi demand patterns over Manhattan modeled as densities, of which the square roots are analyzed on the Hilbert sphere. Numerical properties of the proposed hypothesis tests which utilize the spherical geometry are studied in the real data application and simulations, where we demonstrate that the tests based on the intrinsic geometry compare favorably to those based on an extrinsic or flat geometry.
Numerous quantitative trait loci (QTL) have been mapped in tetraploid and hexaploid wheat and wheat relatives, mostly with simple sequence repeat (SSR) or single nucleotide polymorphism (SNP) markers. To conduct meta-analysis of QTL requires projecting them onto a common genomic framework, either a consensus genetic map or genomic sequence. The latter strategy is pursued here. Of 774 QTL mapped in wheat and wheat relatives found in the literature, 585 (75.6%) were successfully projected onto the Aegilops tauschii pseudomolecules. QTL mapped with SNP markers were more successfully projected (92.2%) than those mapped with SSR markers (66.2%). The QTL were not distributed homogeneously along chromosome arms. Their frequencies increased in the proximal-to-distal direction but declined in the most distal regions and were weakly correlated with recombination rates along the chromosome arms. Databases for projected SSR markers and QTL were constructed and incorporated into the Ae. tauschii JBrowse. To facilitate meta-QTL analysis, eight clusters of QTL were used to estimate standard deviations ([Formula: see text]) of independently mapped QTL projected onto the Ae. tauschii genome sequence. The standard deviations [Formula: see text] were modeled as an exponential decay function of recombination rates along the Ae. tauschii chromosomes. We implemented four hypothesis tests for determining the membership of query QTL. The hypothesis tests and estimation procedure for [Formula: see text] were implemented in a web portal for meta-analysis of projected QTL. Twenty-one QTL for Fusarium head blight resistance mapped on wheat chromosomes 3A, 3B, and 3D were analyzed to illustrate the use of the portal for meta-QTL analyses.
Inference for functional linear models in the presence of heteroscedastic errors has received insufficient attention given its practical importance; in fact, even a central limit theorem has not been studied in this case. At issue, conditional mean (projection of the slope function) estimates have complicated sampling distributions due to the infinite dimensional regressors, which create truncation bias and scaling problems that are compounded by non-constant variance under heteroscedasticity. As a foundation for distributional inference, we establish a central limit theorem for the estimated projection under general dependent errors, and subsequently we develop a paired bootstrap method to approximate sampling distributions. The proposed paired bootstrap does not follow the standard bootstrap algorithm for finite dimensional regressors, as this version fails outside of a narrow window for implementation with functional regressors. The reason owes to a bias with functional regressors in a naive bootstrap construction. Our bootstrap proposal incorporates debiasing and thereby attains much broader validity and flexibility with truncation parameters for inference under heteroscedasticity; even when the naive approach may be valid, the proposed bootstrap method performs better numerically. The bootstrap is applied to construct confidence intervals for projections and for conducting hypothesis tests for the slope function. Our theoretical results on bootstrap consistency are demonstrated through simulation studies and also illustrated with real data examples.
We develop a representation of Gaussian distributed sparsely sampled longitudinal data whereby the data for each subject are mapped to a multivariate Gaussian distribution; this map is entirely data-driven. The proposed method utilizes functional principal component analysis and is nonparametric, assuming no prior knowledge of the covariance or mean structure of the longitudinal data. This approach naturally connects with a deeper investigation of the behavior of the functional principal component scores obtained for longitudinal data, as the number of observations per subject increases from sparse to dense. We show how this is reflected in the shrinkage of the distribution of the conditional scores given noisy longitudinal observations towards a point mass located at the true but unobservable FPCs. Mapping each subject's sparse observations to the corresponding conditional score distribution leads to useful visualizations and representations of sparse longitudinal data. Asymptotic rates of convergence as sample size increases are obtained for the 2-Wasserstein metric between the true and estimated conditional score distributions, both for a $K$-truncated functional principal component representation as well as for the case when $K=K(n)$ diverges with sample size $n\to\infty$. We apply these ideas to construct predictive distributions aimed at predicting outcomes given sparse longitudinal data.
We propose a nonparametric method to explicitly model and represent the derivatives of smooth underlying trajectories for longitudinal data. This representation is based on a direct Karhunen--Lo\`eve expansion of the unobserved derivatives and leads to the notion of derivative principal component analysis, which complements functional principal component analysis, one of the most popular tools of functional data analysis. The proposed derivative principal component scores can be obtained for irregularly spaced and sparsely observed longitudinal data, as typically encountered in biomedical studies, as well as for functional data which are densely measured. Novel consistency results and asymptotic convergence rates for the proposed estimates of the derivative principal component scores and other components of the model are derived under a unified scheme for sparse or dense observations and mild conditions. We compare the proposed representations for derivatives with alternative approaches in simulation settings and also in a wallaby growth curve application. It emerges that representations using the proposed derivative principal component analysis recover the underlying derivatives more accurately compared to principal component analysis-based approaches especially in settings where the functional data are represented with only a very small number of components or are densely sampled. In a second wheat spectra classification example, derivative principal component scores were found to be more predictive for the protein content of wheat than the conventional functional principal component scores.
We develop a novel exploratory tool for non-Euclidean object data based on data depth, extending celebrated Tukey's depth for Euclidean data. The proposed metric halfspace depth, applicable to data objects in a general metric space, assigns to data points depth values that characterize the centrality of these points with respect to the distribution and provides an interpretable center-outward ranking. Desirable theoretical properties that generalize standard depth properties postulated for Euclidean data are established for the metric halfspace depth. The depth median, defined as the deepest point, is shown to have high robustness as a location descriptor both in theory and in simulation. We propose an efficient algorithm to approximate the metric halfspace depth and illustrate its ability to adapt to the intrinsic data geometry. The metric halfspace depth was applied to an Alzheimer's disease study, revealing group differences in the brain connectivity, modeled as covariance matrices, for subjects in different stages of dementia. Based on phylogenetic trees of seven pathogenic parasites, our proposed metric halfspace depth was also used to construct a meaningful consensus estimate of the evolutionary history and to identify potential outlier trees. Supplementary materials for this article are available online.