language-icon Old Web
English
Sign In

Statistics of motifs

2006 
In this lecture we will essentially focus on the statistical analysis of the number of overlapping occurrences (count) of a given oligonucleotide (word), or a given degenerated oligonucleotide (motif or word family), in a DNA sequence. Of course, there is no restriction to sequences on a 4 letter alphabet. Related topics will be just mentioned at the end, with appropriate references. Moreover, note that this lecture is part of a more complete presentation published in the book DNA, Words and Models (Robin et al., 2003, 2005) that contains much more references. The question we would like to address is ”does this word occur in this sequence with an expected frequency?” In other words, can we observe it so many times, or so few times, just by chance? Usually, when the answer is no, such word is candidate to get a particular biological meaning; only a candidate: statistical significance is not equivalent to biological significance. As a guiding example, we will look at the occurrences of the octamer gctggtgg in the complete genome of Escherichia coli (leading strands). This word is known as the Chi motif of the bacterium; it is very frequent, with 762 occurrences on the leading strands and it is necessary for the stability of the chromosome. Let us do the following simple calculation: ”if all the 4 octamers would have the same occurrence probability in a sequence of length 4638858, then one expects to see each of them 4638851/4 ≃ 70 times in the sequence. At this point, the Chi motif seems very over-represented in E. coli because we compare 762 occurrences with 70 occurrences. The key idea is indeed to compare the observed count with the one we could expect given some knowledge on the sequence. To decide if a word count is expected or not, we need to know what to expect. This will be defined by a probabilistic model, i.e. by the description of what is “random”. After choosing the appropriate model (Section 2),
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    3
    Citations
    NaN
    KQI
    []