Comparing feature selection methods for highdimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

2015 
Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods - Binomial distribution, Chi-square Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    1
    Citations
    NaN
    KQI
    []