Deep learning-based prediction of protein structure using learned representations of multiple sequence alignments

2020 
The use of amino acid covariation and other sequence-based features as inputs to deep learning-based predictors of contacts and distances in proteins is now commonplace. The prediction process usually begins by constructing a multiple sequence alignment (MSA) containing homologues of the target protein. The most successful approaches combine large feature sets derived from MSAs, meaning that considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, making the predictor faster to run and easier to install and use. Our approach constructs a directly learned representation of all the sequences in an MSA, starting from a one-hot encoding of the sequences. The learned representation is then used as the input to a ResNet, the latter being the now-standard deep architecture for contact and distance prediction. When supplemented with a fast approximation of a precision matrix, the learned representation can be used to produce distance predictions of comparable or greater accuracy as compared to our original DMPfold method. Constructing representations of complete MSAs also opens up ways of deriving other informative properties, such as predictions of likely eventual model accuracy derived solely by looking at the MSA, as well as a complete end-to-end method for directly predicting α-carbon coordinates, again directly from the MSA alone. Our methods will be made available on GitHub under a permissive license, as part of an upcoming new version of DMPfold.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    65
    References
    8
    Citations
    NaN
    KQI
    []