An intermediate grade of finished genomic sequence suitable for comparative analyses

2004 
The strategy of “shotgun sequencing” (Sanger et al. 1982; Wilson and Mardis 1997b; Green 2001) has emerged as the most cost-effective approach for the de novo generation of large amounts of genomic sequence data. Whether applied on individual large-insert clones (C. elegans Sequencing Consortium 1998; International Human Genome Sequencing Consortium 2001), whole genomes (Adams et al. 2000; Venter et al. 2001; Aparicio et al. 2002; Mouse Genome Sequencing Consortium 2002), or a combination of both (Rat Genome Sequencing Project Consortium 2004), shotgun-sequencing strategies are typically performed in two broad phases. In the initial “shotgun” phase, highly redundant sequence data are obtained by generating sequence reads from one or both insert ends of randomly selected subclones derived from the starting DNA (large-insert clone or whole genome). This phase involves high-throughput methodologies and is responsible for generating the great majority of the actual sequence. In the second “finishing” phase, the assembled sequence emanating from the shotgun phase is analyzed and refined, with additional sequence data typically generated to attain long-range continuity and to improve accuracy. Sequence finishing is a low-throughput, craftsman-like process that involves highly skilled personnel performing both computational and experimental procedures in a customized fashion; as a result, it is also relatively expensive. For sequencing the human genome, the Human Genome Project appropriately set very high standards with respect to the quality of the finished sequence (Felsenfeld et al. 1999; International Human Genome Sequencing Consortium 2001; see www.genome.wustl.edu/Overview/finrulesname.php?G16=1). Specifically, there was a rigorous set of standards that ensured consistency among different sequencing centers and a well-defined quality specification that required a low error rate (less than one error per 10,000 bases), the absence of gaps, and confirmation of the final sequence by comparison with a restriction enzyme digest-based fingerprint of each clone. Implementation of these standards yielded a remarkably accurate human genome sequence (International Human Genome Sequencing Consortium 2004), which has provided a powerful foundation for subsequent annotation efforts (Stein 2001; Ashurst and Collins 2003), comparisons with other species' sequences (Aparicio et al. 2002; Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004), and efforts to untangle complex genomic structures, such as segmental duplications (Bailey et al. 2002). However, achieving such high standards required a considerable investment in sequence finishing, estimated to have been 30%–40% of the total cost. At present and with the recent decline in the costs of producing shotgun-sequence data, the resources required to perform such high-quality sequence finishing now correspond to 40%–70% of the total cost (data not shown). It is well recognized that the quality of the sequence generated for the human genome, which we refer to as human-grade finished sequence, is substantially better than that available at the end of the shotgun phase. The latter full-shotgun draft sequence is simply derived from the automated assembly of the full collection of shotgun sequence reads (e.g., that providing greater than eightfold average sequence coverage). It is important to point out that in the progression from full-shotgun to human-grade finished sequence, there is not a linear relationship between the associated additional costs and the enhancement in sequence quality. Indeed, early in this progression, significant gains in quality can be achieved with even small amounts of additional effort (Wilson and Mardis 1997b; Gordon et al. 2001), whereas in later stages, large amounts of effort are often required to accomplish even small quality improvements. In contemplating the sequencing of additional vertebrate genomes beyond the first pair of high-quality reference sequences (i.e., those of the human [International Human Genome Sequencing Consortium 2001, 2004] and mouse [Mouse Genome Sequencing Consortium 2002] genomes), the relative value of sequence finishing is of great interest. Specifically, understanding the relationship between overall sequence quality and the ability to extract relevant information by comparative analyses becomes important, especially in the context of analyzing sequences from multiple species. Motivated to generate genomic sequence from multiple species suitable for comparative analyses (Margulies et al. 2003a,b; Thomas et al. 2003), we sought to investigate whether an intermediate grade of finished sequence could be produced that was both cost-effective and appropriate in terms of quality. Toward that end, we have established an approach for generating what we call comparative-grade finished sequence. Here we report details about comparative-grade finished sequence, as generated on a large scale for bacterial-artificial chromosome (BAC) clones (Shizuya et al. 1992; Birren et al. 1998). In addition, we assess the relative quality of this sequence and the effort and costs associated with producing it.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    50
    References
    72
    Citations
    NaN
    KQI
    []