A post-processing strategy for SVM learning from unbalanced data

The European Symposium on Artificial Neural Networks (2011)

Haydemar Núñez Castro Luis González Abril Cecilio Ángulo

Citation

Reference

Related Paper

Citation Trend

Abstract:

Standard learning algorithms may perform poorly when learning from unbalanced datasets. Based on the Fisher’s discriminant analysis, a post-processing strategy is introduced to deal datasets with significant imbalance in the data distribution. A new bias is defined, which reduces skew towards the minority class. Empirical results from experiments for a learned SVM model on twelve UCI datasets indicates that the proposed solution improves the original SVM, and they also improve those reported when using a z-SVM, in terms of g-mean and sensitivity.

Topics:

Imbalanced Data Classification Techniques

Text and Document Classification Technologies

Rough Sets and Fuzzy Logic

Source

Cite

A Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines

Jong Doo Choi

The class imbalance problem in classification has been recognized as a significant research problem in recent years and a number of methods have been introduced to improve classification results. Rebalancing class distributions (such as over-sampling or under-sampling of learning datasets) has been popular due to its ease of implementation and relatively good performance. For the Support Vector Machine (SVM) classification algorithm, research efforts have focused on reducing the size of learning sets because of the algorithm's sensitivity to the size of the dataset. In this dissertation, we propose a metaheuristic approach (Genetic Algorithm) for under-sampling of an imbalanced dataset in the context of a SVM classifier. The goal of this approach is to find an optimal learning set from imbalanced datasets without empirical studies that are normally required to find an optimal class distribution. Experimental results using real datasets indicate that this metaheuristic under-sampling performed well in rebalancing class distributions. Furthermore, an iterative sampling methodology was used to produce smaller learning sets by removing redundant instances. It incorporates informative and the representative under-sampling mechanisms to speed up the learning procedure for imbalanced data learning with a SVM. When compared with existing rebalancing methods and the metaheuristic approach to under-sampling, this iterative methodology not only provides good performance but also enables a SVM classifier to learn using very small learning sets for imbalanced data learning. For large-scale imbalanced datasets, this methodology provides an efficient and effective solution for imbalanced data learning with an SVM.

10.31274/etd-180810-2653

Cite

Citations (30)

Boosting the oversampling methods based on differential evolution strategies for imbalanced learning

Applied Soft Computing (2021)

Sedat Korkmaz Mehmet Akif Şahman Ahmet Cevahir Çınar Ersin Kaya

Oversampling

Boosting

Differential Evolution

10.1016/j.asoc.2021.107787

Cite

Citations (19)

Class-imbalanced classifiers for high-dimensional data

Briefings in Bioinformatics (2012)

Wei‐Jiun Lin James J. Chen

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.

Quadratic classifier

Margin classifier

Linear classifier

Ensemble Learning

10.1093/bib/bbs006

Cite

Citations (260)

Handling imbalanced dataset using SVM and k-NN approach

AIP conference proceedings (2016)

Yap Bee Wah Hezlin Aryani Abd Rahman Haibo He Awang Bulgiba

Data mining classification methods are affected when the data is imbalanced, that is, when one class is larger than the other class in size for the case of a two-class dependent variable. Many new methods have been developed to handle imbalanced datasets. In handling a binary classification task, Support Vector Machine (SVM) is one of the methods reported to give a high accuracy in predictive modeling compared to the other techniques such as Logistic Regression and Discriminant Analysis. The strength of SVM is the robustness of its algorithm and the capability to integrate with kernel-based learning that results in a more flexible analysis and optimized solution. Another popular method to handle imbalanced data is the random sampling method, such as random undersampling, random oversampling and synthetic sampling. The application of the Nearest Neighbours techniques in sampling approach has been seen as having a bigger advantage compared to other methods, as it can handle both structured and non-structured data. There are some studies that implement an ensemble method of both SVM and Nearest Neighbours with good results. This paper discusses the various methods in handling imbalanced data and an illustration of using SVM and k-Nearest Neighbours (k-NN) on a real-data set.

Oversampling

Robustness

Boosting

Binary classification

Kernel (algebra)

10.1063/1.4954536

Cite

Citations (20)

G-mean based extreme learning machine for imbalance learning

Digital Signal Processing (2019)

JongHyok Ri Hun Kim

Extreme Learning Machine

Representation

Feature Learning

Binary classification

External Data Representation

Online machine learning

10.1016/j.dsp.2019.102637

Cite

Citations (37)

Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

I. G. Kovačića

During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naive Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naive Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets.

Margin classifier

Quadratic classifier

Source

Cite

Citations (0)

Hybrid resampling to handle imbalanced class on classification of student performance in classroom

Yoga Pristyanto Noor Akhmad Setiawan Igi Ardiyanto

Conventional class with many students leads learning materials to be not well absorbed by the students, consequently the results of student learning to be less than the maximum. Therefore, the process of prediction on the success rate of students should be done as early as possible in order to reduce the impact of the problem. The prediction process is done using various methods of data mining based on the classification pattern of the dataset to be processed. In the process, these datasets often have an unbalanced class distribution, it can be a serious constraint when applied to various algorithms for data classification. Therefore, this study discusses the handling of the dataset imbalance using a combination of the SMOTE and OSS methods. The SMOTE and OSS methods work by balancing the distribution of classes on the dataset, this will increase the value of g-mean when implemented on various classification algorithms. In the experiment, the classification algorithms used in this research are, K-NN, Naïve Bayes, and SVM. From the test result, the combination of SMOTE and OSS method can increase the g-mean value of KNN algorithm from 85,519% to 89,367%, Naïve Bayes algorithm from 82,482% to 85,416% and SVM algorithm from 85,829% to 96,503%. These suggest that the combination of the SMOTE and OSS methods can be a solution to address the unbalanced distribution of classes in data mining processes.

Resampling

10.1109/icicos.2017.8276363

Cite

Citations (14)

Improving SVM Classification on Imbalanced Datasets by Introducing a New Bias

Journal of Classification (2017)

Haydemar Núñez Luis González Abril Cecilio Ángulo

10.1007/s00357-017-9242-x

Cite

Citations (33)

Distance-based margin support vector machine for classification

Applied Mathematics and Computation (2016)

Yan-Cheng Chen Chao‐Ton Su

Margin (machine learning)

10.1016/j.amc.2016.02.024

Cite

Citations (13)

Classification for imbalanced dataset based on biased empirical feature mapping

Zhiming Yang Yang Yu Gang Wang

It is shown that an imbalanced datasets can pose serious problems to many real-world classification tasks when support vector machines is used as the learning machine. To solve this problem, we propose a modified method based on biased empirical feature mapping. In the new method, biased discriminant analysis was applied to make all majority samples far away from center of minority samples in empirical feature space, so that generalization ability of the classifier for minority samples can be improved. Through theoretical analysis and empirical study on synthetic datasets and UCI datasets, we show that our method augments the classification accuracy rate effectively.

Feature vector

Empirical Research

Feature (linguistics)

10.1109/i2mtc.2012.6229164

Cite

Citations (2)