Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline.

2020 
In order to explore nonsynonymous mutations and deletions in the spike (S) protein of SARS-CoV-2, we comprehensively analyzed 35,750 complete S protein gene sequences from across six continents and five climate zones around the world, as documented in the GISAID database until June 24, 2020. Through a custom Python-based pipeline for analyzing mutations, we identified 27,801 (77.77% of spike sequences) mutated strains compared to reference Wuhan-Hu-1 strain. Of these strains, 84.40% had only single amino-acid (aa) substitution mutations, but an outlier strain from Bosnia and Herzegovina (EPI_ISL_463893) possessed six aa substitutions. The D614G variant of the major G clade was expectedly predominant across all regions and climates. We found 988 unique aa replacements across 660 positions along the S protein which differed significantly among different continents (p= 0.003) and climatic zones (p= 0.021) based on Kruskal-Wallis test. Moreover, eleven sites showing high variability in aa frequency were also identified, and each of these sites had four types of aa variations at each position. Besides, 17 in-frame deletions at four major regions (three in N-terminal domain and one just downstream of the RBD) may have possible impact on attenuation. Moreover, the mutational frequency differed significantly (p= 0.003, Kruskal-Wallis test) among the SARS-CoV-2 strains worldwide. This study presents a fast and accurate pipeline for identifying nonsynonymous mutations and deletions from large dataset for any particular protein coding sequence, and presents this S protein data as representative analysis. By using separate multi-sequence alignment with MAFFT, removing ambiguous sequences and in-frame stop codons, and utilizing pairwise alignment, this method can derive nonsynonymous mutations (Reference:Position:Strain). We believe this will aid in the surveillance of any proteins encoded by SARS-CoV-2, and will prove to be crucial in tracking the ever-increasing variation of many other divergent RNA viruses in the future.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    53
    References
    18
    Citations
    NaN
    KQI
    []