Analysis of variance when both input and output sets are high-dimensional

2020 
Motivation: Modern genomic data sets often involve multiple data layers (e.g., DNA sequence, gene expression), each of which itself can be high dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns. Results: We propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (EigenANOVA) and one that uses random vectors (Monte Carlo ANOVA, MCANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS), an approach commonly used in multivariate-high-dimensional regressions. The MCANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by EigenANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MCANOVA and EigenANOVA by applying these two methods to the study of multi locus linkage disequilibrium in chicken genomes and to the assessment of inter dependencies between gene expression, methylation and copy number variants in data from breast cancer tumors. Availability: The Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data analyses.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []