MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

Benjamin T James,Hani Z Girgis

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

2018

Benjamin T James
Hani Z Girgis

Grouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2 clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.

Keywords:

Training set
Cluster analysis
Sequence analysis
DNA sequencing
Mutation rate
Bacterial genome size
Biology
Bioinformatics
Scalability
Computational biology
Genetics
Classifier (linguistics)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations