A post-processing strategy for SVM learning from unbalanced data
8
Citation
9
Reference
20
Related Paper
Citation Trend
Abstract:
Standard learning algorithms may perform poorly when learning
from unbalanced datasets. Based on the Fisher’s discriminant analysis,
a post-processing strategy is introduced to deal datasets with significant
imbalance in the data distribution. A new bias is defined, which reduces
skew towards the minority class. Empirical results from experiments for
a learned SVM model on twelve UCI datasets indicates that the proposed
solution improves the original SVM, and they also improve those reported
when using a z-SVM, in terms of g-mean and sensitivity.Cite
The class imbalance problem in classification has been recognized as a significant research problem in recent years and a number of methods have been introduced to improve classification results. Rebalancing class distributions (such as over-sampling or under-sampling of learning datasets) has been popular due to its ease of implementation and relatively good performance. For the Support Vector Machine (SVM) classification algorithm, research efforts have focused on reducing the size of learning sets because of the algorithm's sensitivity to the size of the dataset. In this dissertation, we propose a metaheuristic approach (Genetic Algorithm) for under-sampling of an imbalanced dataset in the context of a SVM classifier. The goal of this approach is to find an optimal learning set from imbalanced datasets without empirical studies that are normally required to find an optimal class distribution. Experimental results using real datasets indicate that this metaheuristic under-sampling performed well in rebalancing class distributions. Furthermore, an iterative sampling methodology was used to produce smaller learning sets by removing redundant instances. It incorporates informative and the representative under-sampling mechanisms to speed up the learning procedure for imbalanced data learning with a SVM. When compared with existing rebalancing methods and the metaheuristic approach to under-sampling, this iterative methodology not only provides good performance but also enables a SVM classifier to learn using very small learning sets for imbalanced data learning. For large-scale imbalanced datasets, this methodology provides an efficient and effective solution for imbalanced data learning with an SVM.
Cite
Citations (30)
Oversampling
Boosting
Differential Evolution
Cite
Citations (19)
A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.
Quadratic classifier
Margin classifier
Linear classifier
Ensemble Learning
Cite
Citations (260)
Data mining classification methods are affected when the data is imbalanced, that is, when one class is larger than the other class in size for the case of a two-class dependent variable. Many new methods have been developed to handle imbalanced datasets. In handling a binary classification task, Support Vector Machine (SVM) is one of the methods reported to give a high accuracy in predictive modeling compared to the other techniques such as Logistic Regression and Discriminant Analysis. The strength of SVM is the robustness of its algorithm and the capability to integrate with kernel-based learning that results in a more flexible analysis and optimized solution. Another popular method to handle imbalanced data is the random sampling method, such as random undersampling, random oversampling and synthetic sampling. The application of the Nearest Neighbours techniques in sampling approach has been seen as having a bigger advantage compared to other methods, as it can handle both structured and non-structured data. There are some studies that implement an ensemble method of both SVM and Nearest Neighbours with good results. This paper discusses the various methods in handling imbalanced data and an illustration of using SVM and k-Nearest Neighbours (k-NN) on a real-data set.
Oversampling
Robustness
Boosting
Binary classification
Kernel (algebra)
Cite
Citations (20)
Extreme Learning Machine
Representation
Feature Learning
Binary classification
External Data Representation
Online machine learning
Cite
Citations (37)
During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naive Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naive Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets.
Margin classifier
Quadratic classifier
Cite
Citations (0)
Conventional class with many students leads learning materials to be not well absorbed by the students, consequently the results of student learning to be less than the maximum. Therefore, the process of prediction on the success rate of students should be done as early as possible in order to reduce the impact of the problem. The prediction process is done using various methods of data mining based on the classification pattern of the dataset to be processed. In the process, these datasets often have an unbalanced class distribution, it can be a serious constraint when applied to various algorithms for data classification. Therefore, this study discusses the handling of the dataset imbalance using a combination of the SMOTE and OSS methods. The SMOTE and OSS methods work by balancing the distribution of classes on the dataset, this will increase the value of g-mean when implemented on various classification algorithms. In the experiment, the classification algorithms used in this research are, K-NN, Naïve Bayes, and SVM. From the test result, the combination of SMOTE and OSS method can increase the g-mean value of KNN algorithm from 85,519% to 89,367%, Naïve Bayes algorithm from 82,482% to 85,416% and SVM algorithm from 85,829% to 96,503%. These suggest that the combination of the SMOTE and OSS methods can be a solution to address the unbalanced distribution of classes in data mining processes.
Resampling
Cite
Citations (14)
Cite
Citations (33)
Margin (machine learning)
Cite
Citations (13)
It is shown that an imbalanced datasets can pose serious problems to many real-world classification tasks when support vector machines is used as the learning machine. To solve this problem, we propose a modified method based on biased empirical feature mapping. In the new method, biased discriminant analysis was applied to make all majority samples far away from center of minority samples in empirical feature space, so that generalization ability of the classifier for minority samples can be improved. Through theoretical analysis and empirical study on synthetic datasets and UCI datasets, we show that our method augments the classification accuracy rate effectively.
Feature vector
Empirical Research
Feature (linguistics)
Cite
Citations (2)