Classification in Imbalanced Datasets

Thomas Debray,Evgueni N. Smirnov

Classification in Imbalanced Datasets

2009

In this thesis we study the classification task in the presence of class imbalanced data. This task arises in many applications when we are interested in the under-represented (minority) classes. Examples of such applications are related to fraud detection, medical diagnosis and monitoring, text categorization, risk management, information retrieval and filtering. Although there exist many standard approaches to the classification task, most of them have poor generalisation performance on the minority class. This thesis studies well-known approaches to the classification problem in the presence of class imbalanced data, such as Cost-Sensitivity, Bagging for Imbalanced Datasets, MetaCost and SMOTE. The main contribution of the thesis is a new approach to the problem that we call Naive Bayes Sampling. The approach is a generative approach. It generates new instances of the minority class by bootstrapping values of each feature present in the training data. Experiments show the superiority of our approach on 4 UCI datasets and a medical dataset provided by KULeuven.

Keywords:

Correction
Cite
Save
Machine Reading By IdeaReader

References

Citations