Building comprehensive MS-friendly databases for proteomic analysis of bacterial species of unknown genetic background

2018 
In proteogenomic analysis of prokaryotes of unknown genetic background, merging different gene annotation from genomic data of all strains for a given species is a valuable strategy to help the characterization of the sample. It is also relevant for identification of important amino acid polymorphisms and validation of coding regions. We designed a bioinformatic tool which constructs fasta databases including conserved and unique sequences of strains of a given species. By using mass spectrometry data collected from 8 clinical strains from Mycobacterium tuberculosis, we checked protein identification performance of three sequence databases, one including all proteins from 65 sequenced strains; one built using our tool using the same 65 strains; and one using the assembly of model strain H37Rv. Finally, we built databases for 10 species with complete sequenced genomes and monitored features which are relevant for probabilistic-based protein identification by proteomics. We observed that as expected increase in database complexity correlates with pangenomic complexity. However Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all respectively. This indicate that differences in gene annotation is higher than average between strains of those species.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    0
    Citations
    NaN
    KQI
    []