logo
    A Machine Learning-Based QSAR Model for Benzimidazole Derivatives as Corrosion Inhibitors by Incorporating Comprehensive Feature Selection
    40
    Citation
    51
    Reference
    10
    Related Paper
    Citation Trend
    A set of benzimidazole derivatives previously tested for their tuberculostatic activities was analyzed using the quantitative structure activity relationship (QSAR) method. The activity contributions for structural and substituent effects were determined from the correlation equations which were derived using stepwise regression technique. The resulting QSAR showed that the activity contributions of benzimidazoles depend on the size of the substituents at R2, the field effect of the substituents at R1 and structural parameter IY which indicates the presence of oxygen between the benzimidazole and benzyl or phenyl groups.
    Benzimidazole
    Citations (5)
    Quantitative structure-activity (property) relationship (QSAR/QSPR) models are typically generated with a single modeling technique using one type of molecular descriptors. Recently, we have begun to explore a combinatorial QSAR approach which employs various combinations of optimization methods and descriptor types and includes rigorous and consistent model validation (Kovatcheva, A.; Golbraikh, A.; Oloff, S.; Xiao, Y.; Zheng, W.; Wolschann, P.; Buchbauer, G.; Tropsha, A. Combinatorial QSAR of Ambergris Fragrance Compounds. J. Chem. Inf. Comput. Sci. 2004, 44, 582-95). Herein, we have applied this approach to a data set of 195 diverse substrates and nonsubstrates of P-glycoprotein (P-gp) that plays a crucial role in drug resistance. Modeling methods included k-nearest neighbors classification, decision tree, binary QSAR, and support vector machines (SVM). Descriptor sets included molecular connectivity indices, atom pair (AP) descriptors, VolSurf descriptors, and molecular operation environment descriptors. Each descriptor type was used with every QSAR modeling technique; so, in total, 16 combinations of techniques and descriptor types have been considered. Although all combinations resulted in models with a high correct classification rate for the training set (CCR(train)), not all of them had high classification accuracy for the test set (CCR(test)). Thus, predictive models have been generated only for some combinations of the methods and descriptor types, and the best models were obtained using SVM classification with either AP or VolSurf descriptors; they were characterized by CCR(train) = 0.94 and 0.88 and CCR(test) = 0.81 and 0.81, respectively. The combinatorial QSAR approach identified models with higher predictive accuracy than those reported previously for the same data set. We suggest that, in the absence of any universally applicable "one-for-all" QSAR methodology, the combinatorial QSAR approach should become the standard practice in QSPR/QSAR modeling.
    P-glycoprotein
    Citations (135)
    Feature selection techniques have been widely applied to bioinformatics,where random forests(RF) is an important one.To prove the advantage of RF,significance analysis of microarray(SAM) and ReliefF were employed to compare with it.Support Vectors Machine(SVM) was used to test the feature genes selected by the three methods.The comparison results show that feature genes of RF contain more classification information and can get higher accuracy rate when were applied to classification.As a reliable method,RF should be applied in bioinformatics broadly.
    Feature (linguistics)
    Citations (7)
    Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http://zhaocenter.org/software.
    Feature (linguistics)
    Citations (64)
    In the paper, we present an empirical evaluation of five feature selection methods: ReliefF, random forest feature selector, sequential forward selection, sequential backward selection, and Gini index. Among the evaluated methods, the random forest f
    Feature (linguistics)
    Citations (44)
    This paper presents a new feature selection method based on the changes in out-of-bag (OOB) Cohen kappa values of a random forest (RF) classifier, which was tested on the automatic detection of sleep apnea based on the oxygen saturation signal (SpO 2 ). The feature selection method is based on the RF predictor importance defined as the increase in error when features are permuted. This method is improved by changing the classification error into the Cohen kappa value, by adding an extra factor to avoid correlated features and by adapting the OOB sample selection to obtain a patient independent validation. When applying the method for sleep apnea classification, an optimal feature set of 3 parameters was selected out of 286. This was half of the 6 features that were obtained in our previous study. This feature reduction resulted in an improved interpretability of our model, but also a slight decrease in performance, without affecting the clinical screening performance. Feature selection is an important issue in machine learning and especially biomedical informatics. This new feature selection method introduces interesting improvements of RF feature selection methods, which can lead to a reduced feature set and an improved classifier interpretability.
    Interpretability
    Feature (linguistics)
    Statistical classification
    Citations (26)
    High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.
    Feature vector
    Feature (linguistics)
    Citations (24)
    Quantitative Structure-Activity Relationship (QSAR) has been applied extensively in predicting toxicity of Disinfection By-Products (DBPs) in drinking water. Among many toxicological properties, acute and chronic toxicities of DBPs have been widely used in health risk assessment of DBPs. These toxicities are correlated with molecular properties, which are usually correlated with molecular descriptors. The primary goals of this thesis are: 1) to investigate the effects of molecular descriptors (e.g., chlorine number) on molecular properties such as energy of the lowest unoccupied molecular orbital (ELUMO) via QSAR modelling and analysis; 2) to validate the models by using internal and external cross-validation techniques; 3) to quantify the model uncertainties through Taylor and Monte Carlo Simulation. One of the very important ways to predict molecular properties such as ELUMO is using QSAR analysis. In this study, number of chlorine (NCl) and number of carbon (NC) as well as energy of the highest occupied molecular orbital (EHOMO) are used as molecular descriptors. There are typically three approaches used in QSAR model development: 1) Linear or Multi-linear Regression (MLR); 2) Partial Least Squares (PLS); and 3) Principle Component Regression (PCR). In QSAR analysis, a very critical step is model validation after QSAR models are established and before applying them to toxicity prediction. The DBPs to be studied include five chemical classes: chlorinated alkanes, alkenes, and aromatics. In addition, validated QSARs are developed to describe the toxicity of selected groups (i.e., chloro-alkane and aromatic compounds with a nitro- or cyano group) of DBP chemicals to three types of organisms (e.g., Fish, T. pyriformis, and P.pyosphoreum) based on experimental toxicity data from the literature. The results show that: 1) QSAR models to predict molecular property built by MLR, PLS or PCR can be used either to select valid data points or to eliminate outliers; 2) The Leave-One-Out Cross-Validation procedure by itself is not enough to give a reliable representation of the predictive ability of the QSAR models, however, Leave-Many-Out/K-fold cross-validation and external validation can be applied together to achieve more reliable results; 3) ELUMO are shown to correlate highly with the NCl for several classes of DBPs; and 4) According to uncertainty analysis using Taylor method, the uncertainty of QSAR models is contributed mostly from NCl for all DBP classes.
    Molecular descriptor
    HOMO/LUMO
    Quantum chemical
    Citations (1)