Investigating the Role of Simpson’s Paradox in the Analysis of Top-Ranked Features in High-Dimensional Bioinformatics Datasets
2019
An important problem in bioinformatics consists of identifying the most important features (or predictors),
among a large number of features in a given classification dataset. This problem is often addressed by using
a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the
most relevant features for classification). The large number of studies in this area have, however, an
important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of
Simpson’s paradox, where the positive or negative association between a predictor and a class variable
reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review
and investigate the role of Simpson’s paradox in the analysis of top-ranked predictors in high-dimensional
bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a
predictor and the class variable. We perform computational experiments using four well-known feature
ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes,
where the predictors are Gene Ontology terms. The results show that occurrences of Simpson’s paradox
involving top-ranked predictors are much more common for one of the feature ranking methods.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
30
References
4
Citations
NaN
KQI