Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins

2021 
MotivationAccurate automatic annotation of protein function relies on both innovative models and robust datasets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the data sets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the data sets used in previous DNA-binding protein literature and provide several new data sets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved data sets to two previously published models. Additionally, we provide extensive tests showing how the best models predict across taxonomies. ResultsOur new gradient boosting model, which uses features derived from a published protein language model, outperforms the earlier models. Perhaps surprisingly, so does a baseline nearest neighbor model using BLAST percent identity. We evaluate the sensitivity of these models to perturbations of DNA-binding regions and control regions of protein sequences. The successful data-driven models learn to focus on DNA-binding regions. When predicting across taxonomies, the best models are highly accurate across species in the same kingdom and can provide some information when predicting across kingdoms. Code and Data AvailabilityAll the code and data for this paper can be found at https://github.com/AZaitzeff/tools_for_dna_binding_proteins. Contactalexander.zaitzeff@twosixtech.com
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    44
    References
    0
    Citations
    NaN
    KQI
    []