Calibrating seed-based heuristics to map short DNA reads

2019 
The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding approach, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that reads are mapped to a wrong location due to limitations at the seeding step. We describe the properties of simple exact seeds, skip-seeds and MEM seeds (Maximal Exact Match). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which combined together give a practical solution to the problem of calibrating seeding heuristics for short reads. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    1
    Citations
    NaN
    KQI
    []