MLML: consistent simultaneous estimates of DNA methylation and hydroxymethylation.

2013 
Motivation: The two major epigenetic modifications of cytosines, 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC), coexist with each other in a range of mammalian cell populations. Increasing evidence points to important roles of 5-hmC in demethyla- tion of 5-mC and epigenomic regulation in development. Recently developed experimental methods allow direct single-base profiling of either 5-hmC or 5-mC. Meaningful analyses seem to require combin- ing these experiments with bisulfite sequencing, but doing so naively produces inconsistent estimates of 5-mC or 5-hmC levels. Results: We present a method to jointly model read counts from bisulfite sequencing, oxidative bisulfite sequencing and Tet-Assisted Bisulfite sequencing, providing simultaneous estimates of 5-hmC and 5-mC levels that are consistent across experiment types. base resolution measurements of 5-mC and 5-hmC, respectively. Any two of BS-seq, TAB-seq or oxBS-seq can be combined to profile both the 5-mC and 5-hmC methylomes of a cell popula- tion, and especially when studying 5-hmC, proper interpretation of results depends on having some estimate of the 5-mC level. However, naive manipulation of read count frequencies from independent sequencing experiments often produces two kinds of 'overshoot' problems in estimating 5-mC and 5-hmC levels. When combining BS-seq with TAB-seq, the 5-mC level at a given CpG site can be estimated by subtracting the 5-hmC level (TAB-seq) from the combined 5-mC þ 5-hmC level (BS-seq). The result can be negative, because of random sampling (or sys- tematic error) in each experiment. Similarly, combining TAB-seq and oxBS-seq could lead to estimates of 5-mC and 5-hmC levels exceeding 100%. These overshoot sites may constitute a substan- tial proportion. In one dataset based on oxBS-seq technology, 17% of CpG sites captured by reduced representation bisulfite sequencing (RRBS) and oxRRBS experiments exhibited over- shoot (Booth et al., 2012). To fully leverage the information in these data requires some method for making consistent estimates of 5-mC and 5-hmC levels. We present maximum likelihood methylation levels (MLML) for simultaneous estimation of 5-mC and 5-hmC, combining data from any two of BS-seq, TAB-seq or oxBS-seq, or all three when available. Our estimates are consistent in that 5-mC and 5-hmC levels are non-negative, and never sum over 1. In an important subset of cases, our estimates are not only consistent but also show significantly greater accuracy at sites with lower coverage. 2 METHODS Each of BS-seq, TAB-seq and oxBS-seq provides some amount of information about both the 5-mC and 5-hmC levels. Our approach is to combine information from any pair or all three of these experiments, and arrive at maximum likelihood estimates (MLEs) for the 5-mC and 5-hmC levels. A similar method has been developed in the context of haplotype frequency estimation from pooled sequencing (Kessner et al., 2013). To explain our method, we assume the data are from TAB-seq and BS-seq experiments for the same biological sample. The more general formulation is provided in Supplementary Information. Focusing on an individual CpG site, let pm denote the methylation level (a probability), ph the hydroxymethylation and puð¼ 1 � pmphÞ the level of unmethylated C. In the TAB-seq experiment, let h denote the number of C reads mapping over the CpG site, and let g denote the T reads mapping over the same CpG. The total reads covering the CpG site in the TAB-seq experiment is then h þ g. Similarly, let t denote the number of C reads mapping over the site in the BS-seq
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    31
    Citations
    NaN
    KQI
    []