Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

Thanh-Hai Nguyen,Tan Tai Phan,Cong-Tinh Dao,Dang-Vinh-Phuc Ta,Thi-Ngoc-Cham Nguyen,Nguyen-Minh-Thao Phan,Huynh-Ngoc Pham

Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

2021

Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast amount of metagenomic research. Usually, gene family data are characterized by very high dimension which can be up to millions of features. However, the number of obtained samples is rather small compared to the number of attributes. Therefore, the results in validation sets often exhibit poor performance while we can get high accuracy during training phrases. Moreover, a very large number of features on each gene family dataset consumes a considerable time in processing and learning. In this study, we propose feature selection methods using Ridge Regression on datasets including gene families, then the new obtained set of features is binned by an equal width binning approach and fetched into either a Linear Regression and a One-Dimensional Convolutional Neural Network (CNN1D) to do prediction tasks. The experiments are examined on more than 1000 samples of gene family abundance datasets related to Liver Cirrhosis, Colorectal Cancer, Inflammatory Bowel Disease, Obesity and Type 2 Diabetes. The results from the proposed method combining between feature selection algorithms and binning show significant improvements in both prediction performance and execution time compared to the state-of-the-art methods.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations