A topological data analysis based classification method for multiple measurements
2
Citation
0
Reference
20
Related Paper
Citation Trend
Abstract:
Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. When applying this to two case studies, accuracy exceeds alternative models with additional benefits such as reporting data subsets with high purity along with feature values. For 300 examples of 3 tree species, the accuracy reached 80% after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. Using data from 100 examples of each of 6 point processes, the classifier achieved 96.8% accuracy. In both datasets, the TDA classifier outperformed an alternative model. This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.Keywords:
Topological data analysis
Cite
Uncertainty based active learning has been well studied for selecting informative samples to improve the performance of the classifier. One of the simplest strategy is that we always select samples with top largest uncertainties for a query. However, the selected samples may be very similar to each other, which results in little information added to update the classifier. In other words, we should avoid selecting similar samples for training the classifier. This paper addresses this problem by proposing a novel method using uncertainty based active learning algorithm with diversity constraint by sparse selection. First, uncertainty scores of unlabeled samples are obtained based on the previously trained support vector machine (SVM) classifiers. Then the sample selection is represented as a sparse modeling problem and optimal samples up to the pre-defined batch size are selected for a query. Besides that, two approximated approaches are proposed to solve the sparse problem via greedy search and quadratic programming (QP), respectively. After selection, the SVM classifiers are re-trained with new labeled data and the performance is tested on the testing dataset. We conduct several experiments on three image datasets for image classification task. The experimental results show the proposed method outperforms other four different methods and achieves promising performance.
Contextual image classification
Cite
Citations (19)
We describe new methodology for supervised learning with sparse data, i.e., when the number of input features is (much) larger than the number of training samples (n). Under the proposed approach, all available (d) input features are split into several (t) subsets, effectively resulting in a larger number (t*n) of labeled training samples in lower-dimensional input space (of dimensionality d/t). This (modified) training data is then used to estimate a classifier for making predictions in lower-dimensional space. In this paper, standard SVM is used for training a classifier. During testing (prediction), a group of t predictions made by SVM classifier needs to be combined via intelligent post-processing rules, in order to make a prediction for a test input (in the original d-dimensional space). The novelty of our approach is in the design and empirical validation of these post-processing rules under Group Learning setting. We demonstrate that such post-processing rules effectively reflect general (common-sense) a priori knowledge (about application data). Specifically, we propose two different post-processing schemes and demonstrate their effectiveness for two real-life application domains, i.e., handwritten digit recognition and seizure prediction from iEEG signal. These empirical results show superior performance of the Group Learning approach for sparse data, under both balanced and unbalanced classification settings.
Cite
Citations (4)
Overfitting
Cite
Citations (55)
High dimensional biomedical data are becoming common in various predictive models developed for disease diagnosis and prognosis. Extracting knowledge from high dimensional data which contain a large number of features and a small sample size presents intrinsic challenges for classification models. Genetic Algorithms can be successfully adopted to efficiently search through high dimensional spaces, and multivariate classification methods can be utilized to evaluate combinations of features for constructing optimized predictive models. This paper proposes a framework which can be adopted for building prediction models for high dimensional biomedical data. The proposed framework comprises of three main phases. The feature filtering phase which filters out the noisy features; the feature selection phase which is based on multivariate machine learning techniques and the Genetic Algorithm to evaluate the filtered features and select the most informative subsets of features for achieving maximum classification performance; and the predictive modeling phase during which machine learning algorithms are trained on the selected features to construct a reliable prediction model. Experiments were conducted using four high dimensional biomedical datasets including protein and geneexpression data. The results revealed optimistic performances for the multivariate selection approaches which utilize classification measurements based on implicit assumptions.
Feature (linguistics)
Cite
Citations (3)
Estimation of desired effort is one of the most important activities in software project management. This work presents an approach for estimation based upon various feature selection and machine learning techniques for non-quantitative data and is carried out in two phases. The first phase concentrates on selection of optimal feature set of high dimensional data, related to projects undertaken in past. A quantitative analysis using Rough Set Theory and Information Gain is performed for feature selection. The second phase estimates the effort based on the optimal feature set obtained from first phase. The estimation is carried out differently by applying various Artificial Neural Networks and Classification techniques separately. The feature selection process in the first phase considers public domain data (USP05). The effectiveness of the proposed approach is evaluated based on the parameters such as Mean Magnitude of Relative Error (MMRE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Confusion Matrix. Machine learning methods, such as Feed Forward neural network, Radial Basis Function network, Functional Link neural network, Levenberg Marquadt neural network, Naive Bayes Classifier, Classification and Regression Tree and Support Vector classification, in combination of various feature selection techniques are compared with each other in order to find an optimal pair. It is observed that Functional Link neural network achieves better results among other neural networks and Naive Bayes classifier performs better for estimation when compared with other classification techniques.
Confusion matrix
Bayes error rate
Feature (linguistics)
Cite
Citations (5)
During the past decade methods of multiple classifier systems have been developed as a practical and effective solution for a variety of challenging applications. A wide number of techniques and methodologies for combining classifiers have been proposed in the past years in literature. In our work we present a new approach to multiple classifier systems using rough sets to construct classifier ensembles. Rough set methods provide us with various useful techniques of data classification.
In the paper, we also present a method of reduction of the data set with the use of multiple classifiers. Reduction of the data set is performed on attributes and allows to decrease the number of conditional attributes in the decision table. Our method helps to decrease the number of conditional attributes of the data with a small loss on classification accuracy.
Decision table
Linear classifier
Cite
Citations (19)
To solve the problem that traditional classification algorithms cannot adapt properly to infinity of data and conceptual drifting of data streams,a real-time data stream mining algorithms has come into being.Considering the different processes of discrete attributes and continuous attributes,Bayesian classifier for classifying data streams compresses the chunks of different time windows in the data stream.We only preserve few samples and preserve simple statistics for other samples to make use of history data effectively in the limited space.The experimental results show that the algorithm has a high accuracy of classification.Experiments show that the proposed method is superior to similar algorithms in the performance of classification capability,accuracy and speed.
Concept Drift
Cite
Citations (0)
Data mining is regarded as one of the ten key techniques for challenging problem of oil exploration and development.A practical approach for evaluation of the complex formation was presented using the predictive data mining techniques.Both feature selection and parameter optimization were performed using the genetic algorithm.The unbiased estimation of generalization error was calculated with the repeated cross-validation.The final optimal model was selected from the results obtained by using the multiple learning algorithms.The water-flooded interval in the Lower Kelamayi Reservoir of Liuzhong area in Karamay Oilfield was evaluated by using eight feature subsets and twelve models obtained from five distinct kinds of classification methods,including Decision Tree(DT),Artificial Neural Network,Support Vector Machines(SVM),Bayesian Network and Ensemble Learning method.The results show that the SVM is superior to others in the prediction accuracy(91.47%) and can be used as the final classification model.The DT can be used as the assistant model for discovering knowledge because of its easy understandability.It is suggested that the high-level classification models can be obtained using the data mining approach,and the precision of well log interpretation can be effectively improved in solving the problems such as identification of oil-bearing formation and lithologic discrimination.
Identification
Feature (linguistics)
Cite
Citations (5)
In this paper, we propose a novel approach for dynamic feature selection to be applied in classifier ensembles. This method selects the best attribute subsets for an individual instance or a group of instances of an input dataset. Hence, each testing instance is classified using a unique feature subset in the classification process. The main aim of this paper is to extend a dynamic feature selection method that was proposed for single classifiers, adjusting this approach to be used in classifier ensembles. In order to validate our proposed method, an empirical analysis is conducted to investigate the effectiveness of approach compared to existing ensemble methods. Our findings indicated gains in terms of performance when comparing the proposed method to the existing ensemble methods.
Cascading classifiers
Ensemble Learning
Cite
Citations (6)
In literature, there are studies that consider only one type of measurements such as Wi-Fi or Bluetooth RSS, but these values are not sufficient alone to overcome the problems in dynamically changed environments. In order to deal with this, we propose a novel fingerprint database that contains both Wi-Fi and Bluetooth RSS values in addition to magnetic field measurements obtained from mobile devices. On the other hand, this study presents a verification and validation of RFKON database to determine suitable machine learning algorithm and compare performance of these algorithms that by using feature selection algorithms. The aim of this study is to show performance of the classifiers in the RFKON database. For this purpose, different classifier algorithms which are deterministic algorithms such as k-nearest neighbor, Support Vector Machine, decision tree and probabilistic algorithms such as Naive Bayes and Bayesian Networks are tested using this database. In addition to these tests, ensemble learning algorithms, namely AdaBoost and Bagging, are used to improve the performance of the selected classifier. Also, feature selection algorithms are applied to improve the performance of the selected classifiers. As a conclusion, selected algorithms test results are reevaluated using multi-criteria optimization technique in order to find admissible algorithm in terms of both accuracy and computation time.
RSS
AdaBoost
Statistical classification
Cite
Citations (4)