DSResSol: A sequence-based solubility predictor created with Dilated Squeeze Excitation Residual Networks

2021 
ABSTRACT Protein solubility is an important thermodynamic parameter critical for the characterization of a protein’s function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g. pharmaceutical) applications. Thus, a highly accurate in silico bioinformatics tool for predicting protein solubility from protein sequence is sought. In this study, we developed a deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks. The model captures the frequently occurring amino acid k-mers and their local and global interactions, and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve higher performance in comparison to existing deep learning-based models. DSResSol uses protein sequence as input, outperforming all available sequence-based solubility predictors by at least 5% in accuracy when the performance is evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for fast, reliable, and inexpensive prediction of a protein’s solubility to guide experimental design. Availability The source code, datasets, and web server for this model are available at https://tgs.uconn.edu/dsres_sol
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    50
    References
    0
    Citations
    NaN
    KQI
    []