High-Throughput Variation Detection and Genotyping Using Microarrays
2001
The central goal of human genetics is to identify, characterize and ultimately understand the specific DNA variants that contribute to human phenotypes in general, and human disease in particular (Lander and Schork 1994; Chakravarti 1999; Zwick et al. 2000, 2001; On-line Mendelian Inheritance in Man 2001). The genetic approach to this problem is, in principle, straightforward. First, we identify individuals showing phenotypic variation for the trait of interest. Second, we genotype genetic variants, such as microsatellites or SNPs, in all of the individuals in a study. Third, we perform appropriate statistical tests to identify any genetic variants correlated with variation in the phenotype. Finally, if such variants are found, we perform additional experiments to demonstrate a causal relationship.
Step two poses a question: What genetic variants should be examined? The answer to this question must balance technological and practical considerations. Nevertheless, in the best of all worlds, a researcher would be able to determine the genotype of every base in every sample, that is, a complete resequencing of the entire genome of all individuals under study. No technology currently exists to do this in an economical manner. Moreover, any technology used for this purpose must be capable of extraordinary resequencing accuracy.
Nucleotide diversity in the general human population is ∼8 × 10−4 per site (Cargill et al. 1999; Halushka et al. 1999; The International SNP Map Working [TISMW] Group 2001; Venter et al. 2001; this study). This implies that a randomly selected chromosome will differ from the human reference sequence at ∼8 of every 10,000 bases. Now, imagine a technology that allowed one to rapidly and inexpensively determine the genotype of an individual at every nucleotide site of interest with an accuracy of 99.9%. Such a technology would be remarkable, but insufficient. The problem with only 99.9% accuracy is that this implies 10 errors for every 10,000 bases. Because the true rate of variation is eight in 10,000, 55.5% of the identified variants will be errors. This is unacceptably high. The error rate needs to be much lower.
Microarrays are inherently parallel devices that offer the promise of determining the genotypes of individuals at every site of interest with a limited level of effort (Fodor et al. 1991; Southern et al. 1992; Pease et al. 1994; McGall et al. 1996; Lipshutz et al. 1999). Variation Detection Arrays (VDAs) manufactured by Affymetrix have been used to such an end with success (Chee et al. 1996; Hacia et al. 1996, 1998a,b, 1999, 2000; Hacia and Collins 1999; Halushka et al. 1999; Wang et al. 1998). Unfortunately, it has also been reported that between 12% and 45% of the detected variants are false (Cargill et al. 1999; Halushka et al. 1999; Wang et al. 1998). This indicates that VDAs are, on average, between 99.99% and 99.93% accurate.
Although microarrays may be, on average, insufficiently accurate, it is certainly possible that a large fraction of genotype calls are, in fact, much more accurate than 99.9% and a smaller fraction are much less than 99.9% accurate. The approach used here is to construct an objective statistical framework to distinguish genotype calls that can be made with extraordinary accuracy from those less reliable. The need to build such a framework for microarrays is not a new idea (Southern et al. 1992) and the objectives are to strive for some of the accomplishments that Green and colleagues (Nickerson et al. 1997; Ewing and Green 1998; Ewing et al. 1998; Gordon et al. 1998; Rieder et al. 1998) have made for automated sequencing, namely the assignment to individual genotype calls of a quality score that is larger for calls more likely to be accurate. Green and colleagues, in fact, have done even more; phred provides not only a quality score that increases with increasing accuracy, but also a direct estimate of the probability that a base call is correct.
Researchers performing automated sequencing routinely rely on these phred scores (Ewing and Green 1998; Ewing et al. 1998), and in conjunction with certain other neighborhood quality rules (Altshuler et al. 2000; Mullikin et al. 2000), can achieve an extremely high level of accuracy for SNP discovery (T.I.S.M.W. Group 2001). This work attempts the same task. An objective statistical framework is developed to assign to each VDA genotype call a quality score. Certain simple neighborhood rules are applied, and sites in which extraordinarily high confidence can be placed are distinguished from those less reliable sites. In contrast to automated sequencing experiments that employ only haploid targets (Altshuler et al. 2000; Mullikin et al. 2000), this statistical method can be applied to both haploid and diploid targets. We call the system ABACUS (from Adaptive Background genotype Calling Scheme, see below) and will show that, in general, greater than 99.9999% accuracy can be achieved on >80% of the genotype calls on a VDA.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
40
References
265
Citations
NaN
KQI