A general approach to single-nucleotide polymorphism discovery

1999 
Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits 1 . The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2‐5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence6,7 as a template on which to layer often unmapped, fragmentary sequence data 8‐11 and to use base quality values12 to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery. We started with 1,268,211 bp finished (less than 1 error per 10,000 bp) human reference sequence of 10 genomic clones, with EST content typical of gene-bearing clones. To initiate the analysis procedure (Fig. 1) to identify human ESTs that originated from these clones, we performed a database search against the public EST set (dbEST) and recovered 1,954 hits (representing potentially multiple exons of 1,365 unique ESTs) for which chromatograms were available. Sequence clusters were constructed as groups of overlapping alignments (147 clusters). Sequence traces were re-processed with the PHRED base-calling program 13,14 to obtain base quality values. Subsequent analyses used the full length of the ESTs, including low-quality portions. Cluster members were multiply aligned with an anchored alignment technique. Unlike traditional algorithms, this method rapidly produces correct multiple alignments even in the presence of abundantly expressed or alternatively spliced transcripts. In total, EST clusters represented 80,469 bp of expressed genomic sequence, 38% of this in regions of single EST coverage and 81% in regions covered by 8 or fewer ESTs (Table 1). Inclusion of sequences representing highly similar regions duplicated elsewhere in the genome may give rise to false SNP predictions, and the presence of such sequence paralogues points to difficulties during marker development. We devised a Bayesian 15
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    535
    Citations
    NaN
    KQI
    []