On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

2020 
Associations between high-dimensional datasets, each comprising many features, can be discovered through multivariate statistical methods, like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). CCA and PLS are widely used methods which reveal which features carry the association. Despite the longevity and popularity of CCA/PLS approaches, their application to high-dimensional datasets raises critical questions about the reliability of CCA/PLS solutions. In particular, overfitting can produce solutions that are not stable across datasets, which severely hinders their interpretability and generalizability. To study these issues, we developed a generative model to simulate synthetic datasets with multivariate associations, parameterized by feature dimensionality, data variance structure, and assumed latent association strength. We found that resulting CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, the profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these model trends in state-of-the-art datasets containing neuroimaging and behavioral measurements in large numbers of subjects, namely the Human Connectome Project (n ≈ 1000) and UK Biobank (n = 20000), where we found that only the latter comprised enough samples to obtain stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed that the commonly employed sample sizes yield unstable CCA solutions. Our generative modeling framework provides a calculator of dataset properties required for stable estimates. Collectively, our study characterizes dataset properties needed to limit the potentially detrimental effects of overfitting on stability of CCA/PLS solutions, and provides practical recommendations for future studies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    138
    References
    21
    Citations
    NaN
    KQI
    []