DNA sequencing theory

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects (known as 'strategic genomics'), predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology. DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects (known as 'strategic genomics'), predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology. All mainstream methods of DNA sequencing rely on reading small fragments of DNA and subsequently reconstructing these data to infer the original DNA target, either via assembly or alignment to a reference. The abstraction common to these methods is that of a mathematical covering problem. For example, one can imagine a line segment representing the target and a subsequent process where smaller segments are 'dropped' onto random locations of the target. The target is considered 'sequenced' when adequate coverage accumulates (e.g., when no gaps remain). The abstract properties of covering have been studied by mathematicians for over a century. However, direct application of these results has not generally been possible. Closed-form mathematical solutions, especially for probability distributions, often cannot be readily evaluated. That is, they involve inordinately large amounts of computer time for parameters characteristic of DNA sequencing. Stevens' configuration is one such example. Results obtained from the perspective of pure mathematics also do not account for factors that are actually important in sequencing, for instance detectable overlap in sequencing fragments, double-stranding, edge-effects, and target multiplicity. Consequently, development of sequencing theory has proceeded more according to the philosophy of applied mathematics. In particular, it has been problem-focused and makes expedient use of approximations, simulations, etc. The earliest result may be found directly from elementary probability theory. Suppose we model the above process taking L {displaystyle L} and G {displaystyle G} as the fragment length and target length, respectively. The probability of 'covering' any given location on the target with one particular fragment is then L / G {displaystyle L/G} . (This presumes L ≪ G {displaystyle Lll G} , which is valid often, but not for all real-world cases.) The probability of a single fragment not covering a given location on the target is therefore 1 − L / G {displaystyle 1-L/G} , and [ 1 − L / G ] N {displaystyle left^{N}} for N {displaystyle N} fragments. The probability of covering a given location on the target with at least one fragment is therefore This equation was first used to characterize plasmid libraries, but it may appear in a modified form. For most projects N ≫ 1 {displaystyle Ngg 1} , so that, to a good degree of approximation where R = N L / G {displaystyle R=NL/G} is called the redundancy. Note the significance of redundancy as representing the average number of times a position is covered with fragments. Note also that in considering the covering process over all positions in the target, this probability is identical to the expected value of the random variable C {displaystyle C} , the fraction of the target coverage. The final result, remains in widespread use as a 'back of the envelope' estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy. In 1988, Eric Lander and Michael Waterman published an important paper examining the covering problem from the standpoint of gaps. Although they focused on the so-called mapping problem, the abstraction to sequencing is much the same. They furnished a number of useful results that were adopted as the standard theory from the earliest days of 'large-scale' genome sequencing. Their model was also used in designing the Human Genome Project and continues to play an important role in DNA sequencing. Ultimately, the main goal of a sequencing project is to close all gaps, so the 'gap perspective' was a logical basis of developing a sequencing model. One of the more frequently used results from this model is the expected number of contigs, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially 'wasted' by having to detect overlaps, their theory yields

Parent Topic

Child Topic

No Parent Topic