Identification of Triple-Negative Breast Cancer Genes and a Novel High-Risk Breast Cancer Prediction Model Development Based on PPI Data and Support Vector Machines

2019 
Triple-negative breast cancers (TNBC) comprise a very heterogeneous group of cancers which are difficult to treat. It is crucial to identify breast cancer-related genes, which could provide new biomarkers for breast cancer diagnosis as well as potential treatment targets. In development of our new high-risk breast cancer prediction model, seven gene expression raw data sets from NCBI GEO database (GSE31519, GSE9574, GSE20194, GSE20271, GSE32646, GSE45255, and GSE15852) were used. Using the minimum redundancy-maximum relevance (mRMR) method, we selected significant genes. Then we mapped transcripts of the genes on the protein-protein interaction (PPI) network retrieved from STRING database and traced the shortest path between each pair of two proteins. Genes with higher betweenness values were selected from the shortest path proteins. In order to ensure the validity and precision, a permutation test was performed. We randomly selected 248 proteins from the PPI network for shortest path tracing and repeated the procedure for 100 times. And we removed genes that appear more frequently in randomized results. As a result, 54 genes for triple-negative breast cancers were selected as potential triple-negative cancer-related genes. Using the 14 genes from 54 genes for potential triple-negative cancer-related genes as features based on support vector machine (SVM), a novel modelwas developed to predict high risk breast cancer. The prediction accuracy of normal tissues and TNBC tissues reached 95.394%, and the predictions of Stage II and Stage III TNBC reached 86.598%, indicating such genes play important roles in distinguishing breast cancers, and the method could be promising in practical use. Some of the 54 genes we identified from the PPI network have been reported to be related to breast cancer in literature. Several other genes have not yet reported but possessed more functional similarity with the known cancer genes, which may be novel breast cancer related genes and need further experimental validation. Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis were performed on the identified 54 genes. It is indicated that cellular response to organic cyclic compound has an influence in breast cancer, and most genes may be related with viral carcinogenesis.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    103
    References
    9
    Citations
    NaN
    KQI
    []