Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences

2000 
by statistical analysis of human expressed sequence tags (ESTs), associated primarily with coding regions of genes. We used Bayesian inference to weigh evidence for true polymorphism versus sequencing error, misalignment or ambiguity, misclustering or chimaeric EST sequences, assessing data such as raw chromatogram height, sharpness, overlap and spacing, sequencing error rates, context-sensitivity and cDNA library origin. Three separate validations comparison with 54 genes screened for SNPs independently, verification of HLA-A polymorphisms and restriction fragment length polymorphism (RFLP) testing verified 70%, 89% and 71% of our predicted SNPs, respectively. Our method detects tenfold more true HLA-A SNPs than previous analyses of the EST data. We found SNPs in a large fraction of known disease genes, including some disease-causing mutations (for example, the HbS sickle-cell mutation). Our comprehensive analysis of human coding region polymorphism provides a public resource for mapping of disease genes (available at http://www.bioinformatics.ucla.edu/snp). We analysed more than 542 million base pairs of EST and mRNA sequences from Unigene 12 (release 29 March 2000), producing 48,196 candidate SNPs with a lod score greater than 3 in favour of a polymorphism as opposed to sequencing error. To test our SNPs experimentally, we selected a subset of candidates that cause RFLPs in several score ranges (lod 2‐6, 6‐20, >20; Fig. 1). Of 79 SNP candidates tested so far in 8‐24 DNA samples, 56 showed the expected pattern of polymorphism (71%). The verification rate was lower in the 2‐6 lod score range (38%) compared with higher lod scores (6‐20, 69%; >20, 79%). To further validate our results, we examined genes in which independent experimental studies have systematically searched for polymorphisms. For these genes, we determined whether any of our SNPs are not independently found by these studies. We evaluated all SNPs (lod>3) within the protein-coding regions of 54 genes independently screened by the Whitehead InstituteAffymetrix coding single nucleotide polymorphisms (WIAFCSNP) project for polymorphisms in 40 people 9 . We mapped each of our SNPs onto the gene sequence referenced by WIAF-CSNP, and required a perfect match for its location, major nucleotide and minor nucleotide compared with a SNP reported by WIAF-CSNP. For the 54 genes analysed, 70% of our SNPs matched polymorphisms found independently by WIAF-CSNP (Table 1). Of the SNPs that were not reported by WIAF-CSNP, review of the sequencing and alignment evidence indicates that several are good candidates for real polymorphisms, and may have been missed due to experimental sampling. Several others appear to result from misclustering of paralogous sequences in the Unigene database. At least six additional SNPs reported by WIAF-CSNP within these genes were also identified by our calculations with lod scores less than 3, indicating that there are many more real SNPs in our data below our scoring threshold. We did not count these SNPs in the validation results. Our calculations also identified a number of SNPs (lod>3) in non-coding regions that were confirmed by WIAF-CSNP, but again we did not count these. To assess the rate of SNP false positives and false negatives in a larger population sample, we examined a gene (HLA-A) whose
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    132
    Citations
    NaN
    KQI
    []