Faster Computation of Genome Mappability with one Mismatch

2018 
The genome mappability problem refers to cataloging repetitive occurrences of every substring of length m in a genome, and its k-mappability variant extends this to approximate repeats by allowing up to k mismatches. This problem is formulated as follows: Given a sequence S[1, n] of length n over the constant DNA alphabet Σ = {A, C, G, T}, and two integers k and m ≤ n, output an integer array F k , such that: F k [i] = |{j ≠ i|d H (S[i, i + m − 1], S[j, j + m − 1]) ≤ k}| where d H (•,•) represents the hamming distance. Derrien et al. [PLoS one 2012] represented this problem within the framework of genome analysis. In this work we present a provably efficient algorithm for 1-mappability with O(n log n) worst case run time and O(n) spece. The fundamental technique is the heavy path decomposition on the suffix tree (ST) of S, and the entire work is based on the framework by Thankachan et al. [RECOMB 2018]. The previous best known run time is O(n log n log log n) [Alzamel et al., COCOA 2017].
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    1
    Citations
    NaN
    KQI
    []