The Influence of Sampling on Imbalanced Data Classification

2019 
Classification tasks using imbalanced data are not challenging on their own. When the classes are linearly separable, a regular classification algorithm usually induces predictive models able to distinguish the classes properly. Imbalanced data poses difficulty for the minority class when the training sets have classes overlapping or a complex border decision. Assessing these characteristics is fundamental to understand the classification task difficulty and to choose adequate pre-processing techniques for imbalanced data. Measures able to identify the complexity of a classification task for a given dataset have been proposed. These measures use different criteria to identify how difficult it is to induce a classifier from a dataset. In this paper, we investigate the use of data complexity measures to estimate the best sample size for data imbalance pre-processing techniques. For such, this paper assesses the predictive performance and the data complexity of real datasets after applying pre-processing techniques using different sample sizes. According to experiments, the data complexity measures are a tool to help in choosing a proper sample size to improve the predictive performance of the classifiers. We also observe that only the difficulty of predicting the minority class is not enough when dealing with sampling. As an alternative to deal with this deficiency, we suggest a combination of the data complexity of both classes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    0
    Citations
    NaN
    KQI
    []