Machine Learning in Prediction of Intrinsic Aqueous Solubility of Drug-like Compounds: Generalization, Complexity or Predictive Ability?

2020 
Here, we present a collection of publicly availableintrinsic aqueous solubility data of 829 drug-likecompounds. Four different machine learning algorithms(random forest, light GBM, partial least squares andLASSO) coupled with multi-stage permutationimportance for feature selection and Bayesian hyperparameter optimization were employed for theprediction of solubility based on chemical structuralinformation. Our results have shown that LASSOyielded the best predictive ability on an external test setwith and RMSE(test) of 0.70 log points and 105 featuresin the model. Taking into account the number ofdescriptors as well, an RF model achieved the bestbalance between complexity and predictive ability withan RMSE(test) of 0.72 with only 17 features. Wepropose a ranking score for choosing the best model, astest set performance is only one of the factors in creatingan applicable model. The ranking score is a weightedcombination of generalization, number of featuresinvolved and test set performance The data related to this paper can be downloaded from 10.5281/zenodo.3968754
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []