FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores

2018 
Pairwise alignment has been the predominant algorithm in the field of bioinformatics since its beginning. Several applications have been made in order to speed up this algorithm using heuristics, but almost all of these methods still depend on the slow quadratic alignment algorithm. Many applications utilize sequence identity scores without the corresponding alignments, e.g. scanning a database for similar sequences to a query sequence or sequence clustering. For these applications, we propose FASTCAR, which is the first machine-learning application that predicts alignment identity scores using completely alignment-free methods. Training data are produced from the input database by a generative method, mutating sequences to generate known alignment identity scores, thereby bypassing alignment algorithms. We evaluated FASTCAR, USEARCH, and BLAST by using them to scan three large-scale databases consisting of millions of sequences. FASTCAR is faster, up to 100 times, than USEARCH and BLAST. FASTCAR has reasonable sensitivity and accuracy while achieving the highest specificity, precision, and F-measure. Identity scores produced by FASTCAR are closer to the scores of the pure alignment algorithm than those produced by USEARCH and BLAST. This is the first time when the identity scores can be obtained in linear time and linear space.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    4
    Citations
    NaN
    KQI
    []