Increased accuracy and speed in whole genome bisulfite read mapping using a two-letter alphabet

2020 
DNA methylation, characterized by the presence of methyl group at cytosines in a DNA sequence, is an important epigenomic mark with a wide range of functions across diverse organisms. Whole genome bisulfite sequencing (WGBS) has emerged as the gold standard to interrogate cytosine methylation. Accurately mapping WGBS reads to a reference genome allows reconstruction of tissue methylomes at single-base resolution. Algorithms used to map WGBS reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. We introduce another bisulfite mapping algorithm (abismal), based on the novel idea of encoding a four-letter DNA sequence as two letters, one for purines and one for pyrimidines. We show theoretically that this encoding benefits from higher uniformity and specificity when subsequences are selected from reads for filtration. In our implementation, this leads to a decreased mapping time relative to the three-letter encoding. We demonstrate, using data from multiple public studies, that the abismal software tool improves mapping accuracy at significantly lower mapping times compared to commonly used mappers, with most notable improvements observed in samples originating from the random priming post-bisulfite adapter tagging protocol.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    0
    Citations
    NaN
    KQI
    []