Integrative Sufficient Dimension Reduction Methods for Multi-Omics Data Analysis
2017
With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| P δ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we proposed a new integrative SIR method, a supervised dimension reduction technique for integrative analysis of multi-omics data types. Unlike conventional SDR methods, the new approach can reduce the dimensions of multiple omics data simultaneously while sharing common latent structures across data types without losing any information in prediction. By efficiently capturing the common information, our numerical study shows that integrative SIR classifies tumor types more accurately as compared to conventional SDR methods.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
29
References
0
Citations
NaN
KQI