Principal component approximation and interpretation in health survey and biobank data

2018 
Background Increasing numbers of variables in surveys and administrative databases are created. Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. By analyzing the variance distribution according to PCA loadings and approximating PCs with input variables, we aim to demonstrate the importance of variables based on the proportions of total variances contributed or explained by input variables. Methods There were five data sets of various sizes used to understand the performance of PC approximation: Hitters, SF-12v2 subset of the 2004 to 2011Medical Expenditure Panel Survey (MEPS), and the full set of 1996 to 2011 MESP data, along with two data sets derived from the Canadian Health Measures Survey (CHMS): a spirometry subset with the measures from the first trial of spirometry and a full data set that contained non-redundant variables. The variables in data sets were first centered and scaled before PCA. PCs approximation was studied with two approaches: PCA loadings and PC approximation through forward regression. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables. Results The first few PCs had large variances in each data set. Approximating PCs using stepwise regression could efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in the data sets. It required fewer numbers of variables to explain more than 80% of the PC variances through stepwise regression. Conclusion Approximating and interpreting PCs with stepwise regression is highly feasible. PC approximation is useful to 1) interpret PCs with input variables, 2) understand the major sources of variances in data sets, 3) select unique sources of information and 4) search and rank input variables according to the proportions of PC variance explained. This can be an approach to systematically understand databases and search for variables that are important to databases.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    5
    Citations
    NaN
    KQI
    []