PaSiT: A novel approach based on short oligo-nucleotide frequencies for efficient bacterial identification and typing

2020 
MOTIVATION: One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step, and is therefore not suitable for large-scale comparisons. Short-oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here we aim to address this shortcoming by providing a software that implements a novel method based on short oligonucleotide frequencies to compute inter-genomic distances. RESULTS: Our tetra-nucleotide and hexa-nucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short-oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. AVAILABILITY: The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a java-based graphical user interface that acts as a wrapper for the software is also available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Raw reads for all newly sequenced bacteria are available from the European Nucleotide Archive, under accession number PRJEB32402.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    2
    Citations
    NaN
    KQI
    []