Feature selection of Complex Power Quality Disturbances and Parameter Optimization of Random Forest
2
Citation
13
Reference
10
Related Paper
Citation Trend
Abstract:
A hybrid method based on modified discrete artificial bee colony algorithm(MDABC) for power quality disturbance(PQD) signal feature selection and parameter optimization of random forest(RF) is proposed. Firstly, time-frequency features of the complex power quality disturbance signal are extracted using s-transform(ST) to form an original feature set; Then, by using default parameters of RF, the out-of-bag(OOB) Permutation test value of each feature in the original feature set is calculated as the feature weight, and the features of the training and verification data set are rearranged in descending order accordingly. Finally, taking the generalization error of RF as the objective function, the forest scale: nTree, the number of input features:Ni and the node feature subset size:q are optimized using MDABC to determine the optimal parameters of RF and the optimal feature set. It can be seen from the experiment that compared with the RF classifier before optimization, the accuracy of the MDABC-RF classifier in the classification of 16 and 19 complex PQD signals is better, and its operating efficiency is greatly improved.Keywords:
Feature (linguistics)
군특성화고 정책은 양질의 군인적자원을 수급하기 위해 도입된 정책으로 학·군(學軍) 협력을 기반으로 하는 국방인적자원관리(Military HRM)의 성격을 지닌다. 이에 본 연구는 머신러닝을 활용하여 군특성화고 정책이 내재한 인적자원개발의 측면을 실증적으로 분석하고 전문병 선발 예측 모델과 중요 변수를 제시한다.BR 이를 위해 국내 군특성화고등학교 A학교의 졸업생 850여 명의 교육 및 진로 데이터의 전처리를 수행하여 50여개의 투입변수를 최종적으로 획득하였다. '전문병 선발'을 타겟변수로 선정하여 과대 표집을 통해 타겟변수의 클래스 불균형을 해소한 후 머신러닝의 예측모델을 훈련하였다.BR 전문병 선발을 정확하게 예측할 수 있는 최적 모델 수립을 위해 Random Forest, XGBoost, LightGBM, SVM, Logistic과 같은 5개 머신러닝 알고리즘을 타겟변수 클래스가 불균형한 원천 데이터와 과대표집을 시행한 과대 표집데이터에 모두 적용하여 총 10개의 모델을 훈련하였다. 모델 훈련 과정에서 층화 k-Fold 교차검증을 함께 수행하여 과적합을 예방하였고 최적 모델을 구현하는 데 적합한 초매개변수를 탐색하였다.BR 훈련 결과 Random Forest 알고리즘으로 훈련한 모델의 예측 성능이 원천 데이터 및 과대표집 데이터로 훈련한 모든 경우에서 가장 우수하였다. AUC값을 기준으로 할 때 원천 데이터로 훈련한 Random Forest(RF) 모델 성능은 0.76에 근사했고 과대표집 데이터로 훈련한 Random Forest 모델(RF_over) 성능은 0.85 수준으로 향상했다. 투입변수 중요도를 평가한 결과 50여 개 투입변수 중'면허_취득/미취득', '전공기능사' 등 전공 전문성과 관련된 변수가'전문병 선발'여부에 가장 큰 영향을 미친 것으로 나타났다.BR 추가적으로 모델의 편향성을 점검하기 위해 원천 데이터와 과대표집 데이터를 무작위로 표집하여 평가를 실시한 결과 RF와 RF_over 두 모델의 AUC 값이 모두 0.5에 수렴하는 결과를 보였다. 이는 훈련한 머신러닝 모델이 특정 변수에 의존하지 않으면서 상당한 수준의 성능을 보이는 것으로 이해할 수 있다.BR 본 연구의 결과는 머신러닝을 활용한 군특성화고 연구의 가능성을 제시할 뿐 아니라 실제 교육현장에서 군특성화고 정책의 효과성에 기여하는 요소를 특정할 수 있음을 보여준다. 이러한 결과는 군특 전문병의 원활한 선발과 수급을 위해 전공 전문성 및 교육훈련을 강화한 인적자원관리의 필요성을 제기한다. 또한 이를 통해 머신러닝을 활용한 인사이트 획득과 데이터에 기반한 전사적 국방인적자원관리의 가능성을 모색할 수 있을 것으로 기대한다.
Cite
Citations (0)
Random Forest (RF) is a powerful supervised learner and has been popularly used in many applications such as bioinformatics. In this work we propose the guided random forest (GRF) for feature selection. Similar to a feature selection method called guided regularized random forest (GRRF), GRF is built using the importance scores from an ordinary RF. However, the trees in GRRF are built sequentially, are highly correlated and do not allow for parallel computing, while the trees in GRF are built independently and can be implemented in parallel. Experiments on 10 high-dimensional gene data sets show that, with a fixed parameter value (without tuning the parameter), RF applied to features selected by GRF outperforms RF applied to all features on 9 data sets and 7 of them have significant differences at the 0.05 level. Therefore, both accuracy and interpretability are significantly improved. GRF selects more features than GRRF, however, leads to better classification accuracy. Note in this work the guided random forest is guided by the importance scores from an ordinary random forest, however, it can also be guided by other methods such as human insights (by specifying $\lambda_i$). GRF can be used in "RRF" v1.4 (and later versions), a package that also includes the regularized random forest methods.
Interpretability
Feature (linguistics)
Random testing
Cite
Citations (60)
Aims: This work aim is to develop an enhanced predictive system for Coronary Heart Disease (CHD).
Study Design: Synthetic Minority Oversampling Technique and Random Forest.
Methodology: The Framingham heart disease dataset was used, which was collected from a study in Framingham, Massachusetts, the data was cleaned, normalized, rebalanced. Classifiers such as random forest, artificial neural network, naïve bayes, logistic regression, k-nearest neighbor and support vector machine were used for classification.
Results: Random Forest outperformed other classifiers with an accuracy of 98%, a sensitivity of 99% and a precision of 95.8%. Feature selection was employed for better classification, but no significant improvement was recorded on the performance of the classifier with feature selection. Train test split also performed better that cross validation.
Conclusion: Random Forest is recommended for research in Coronary Heart Disease prediction domain.
Cross-validation
Cite
Citations (2)
Feature selection techniques have been widely applied to bioinformatics,where random forests(RF) is an important one.To prove the advantage of RF,significance analysis of microarray(SAM) and ReliefF were employed to compare with it.Support Vectors Machine(SVM) was used to test the feature genes selected by the three methods.The comparison results show that feature genes of RF contain more classification information and can get higher accuracy rate when were applied to classification.As a reliable method,RF should be applied in bioinformatics broadly.
Feature (linguistics)
Cite
Citations (7)
Ensemble methods such as random forest works well on high-dimensional datasets. However, when the number of features is extremely large compared to the number of samples and the percentage of truly informative feature is very small, performance of traditional random forest decline significantly. To this end, we develop a novel approach that enhance the performance of traditional random forest by reducing the contribution of trees whose nodes are populated with less informative features. The proposed method selects eligible subsets at each node by weighted random sampling as opposed to simple random sampling in traditional random forest. We refer to this modified random forest algorithm as "Enriched Random Forest". Using several high-dimensional micro-array datasets, we evaluate the performance of our approach in both regression and classification settings. In addition, we also demonstrate the effectiveness of balanced leave-one-out cross-validation to reduce computational load and decrease sample size while computing feature weights. Overall, the results indicate that enriched random forest improves the prediction accuracy of traditional random forest, especially when relevant features are very few.
Feature (linguistics)
Stratified Sampling
Cite
Citations (47)
Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http://zhaocenter.org/software.
Feature (linguistics)
Cite
Citations (64)
In the paper, we present an empirical evaluation of five feature selection methods: ReliefF, random forest feature selector, sequential forward selection, sequential backward selection, and Gini index. Among the evaluated methods, the random forest f
Feature (linguistics)
Cite
Citations (44)
This paper presents a new feature selection method based on the changes in out-of-bag (OOB) Cohen kappa values of a random forest (RF) classifier, which was tested on the automatic detection of sleep apnea based on the oxygen saturation signal (SpO 2 ). The feature selection method is based on the RF predictor importance defined as the increase in error when features are permuted. This method is improved by changing the classification error into the Cohen kappa value, by adding an extra factor to avoid correlated features and by adapting the OOB sample selection to obtain a patient independent validation. When applying the method for sleep apnea classification, an optimal feature set of 3 parameters was selected out of 286. This was half of the 6 features that were obtained in our previous study. This feature reduction resulted in an improved interpretability of our model, but also a slight decrease in performance, without affecting the clinical screening performance. Feature selection is an important issue in machine learning and especially biomedical informatics. This new feature selection method introduces interesting improvements of RF feature selection methods, which can lead to a reduced feature set and an improved classifier interpretability.
Interpretability
Feature (linguistics)
Statistical classification
Cite
Citations (26)
Lasso
Feature (linguistics)
Cite
Citations (0)