Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results

2011 
Complex traits and diseases, such as body-mass index, height, diabetes, heart disease, and psychiatric disorders are undoubtedly caused by multiple genetic and environmental factors, although it has been a major challenge to identify specific genes. Recently, genome-wide association studies (GWAS) have resulted in the detection of many robustly associated single nucleotide polymorphism (SNP) variants across a range of outcomes [1], although for any particular disease or trait the SNP variants detected explain only a fraction of the total genetic variance calculated from family studies. The gap between the two has been termed the “missing heritability” [2],[3]. Many reasons for the missing heritability have been given [3]. One plausible explanation is that rare variants, which existing GWAS platforms are not designed to capture, make significant contributions to the heritability of many traits and diseases. It is indeed likely that many multifactorial and heterogeneous phenotypes will be influenced by a diverse array of genetic factors that span the spectrum from private mutation to common variant. Dickson and colleagues [4],[5] recently took a step further, by arguing that rare variants might explain not only some of the heritability that is currently missing, but also that they may be the cause of a proportion of detected associations between complex traits and common SNPs from GWAS. Based on computer simulations, they proposed that some constellations of variants within a narrow frequency and effect size range can account for “many” of the observed associations between complex traits and common SNPs from GWAS. This is a strong claim and one that they say has important implications for the “design of future studies to detect causal variants.” It is of great importance to the research community to establish whether “many” represents an important proportion of GWAS results to date, since indeed this can impact on decisions of experimental design and allocation of research funds. Dickson et al. define synthetic association as the association of a genotyped common marker resulting from multiple unobserved low-frequency causal variants (see Figure 1). The variance contributed by the causal variants would be much higher than variance explained by the associated genotyped SNP, because the genotyped SNPs will not “tag” (see Box 1) the causal variants with great precision, thus leading to the “missing” heritability from GWAS. Importantly, synthetic associations may arise many hundreds of kilobases (kb) from the site of the causal variant(s), which would hamper attempts to locate the causal variants responsible for association signals by fine-mapping. Dickson et al. claim that rare variants can give rise to synthetic associations that are similar to many observed GWAS associations. As we show below, however, synthetic associations in fact tend to differ in some important ways to observations from GWAS. Furthermore, even if rare variants can, in principle, give rise to associations detectable in GWAS, the converse proposition (that, for a given trait, many, or even any, detected GWAS associations arise from rare variants) does not automatically follow. Box 1. The Dickson et al. Genetic Model and Simulations Dickson et al. [4] used coalescence theory (Box 2) [19] to simulate patterns of LD that are consistent with an evolutionary process, and then mimicked a GWAS by simulating cases and controls and performing association with disease status and common tagging SNPs (MAF>0.05). Specifically, each simulation was of a genomic region of length 100 kb (representing on 1/30,000th of the genome). To generate realistic patterns of SNP frequencies they assumed an effective population size of 10,000 and a mutation rate of 10−8. Within a 100 kb region up to 9 causal SNPs, each with frequency between 0.005 and 0.02 were allocated to influence disease (causal SNPs). Therefore, at a locus with 9 such variants, ∼20% of the general population would be expected to carry at least one disease risk allele. The baseline probability of disease was 1% or 10%, and each risk variant had the same increased risk for disease (genotype relative risk, GRR, see Box 3) compared to the baseline. Each simulation generated 10,000 haplotypes of the 100 kb region. Individuals in the population were simulated by sampling, with replacement, pairs of haplotypes; these were allocated case or control status based on the probability of disease associated with the number or risk loci they carried (with GRR combining multiplicatively when an individual carried multiple risk alleles—this is not a common event, only about 1% of individuals will carry more than one risk allele when there are 9 causal SNPs in the 100 kb region). A case control study was simulated by selecting equal numbers of cases and controls. The simulations varied three parameters – the number of causal SNPs (1,3,5,7,9), the sample size of the case control study (2,000, 4,000, 6,000) and the GRR associated with each risk allele (2,3,4,5,6). Most simulations were conducted in the absence of recombination. The more realistic scenario of recombination (comparing different rates) was considered only when GRR = 4. The simulation of recombination divided the 100 kb region into 200 fragments of 500 bp with no recombination within, and only recombination between, segments. Additional simulations also considered 9 causal variants of GRR = 4 in a 10 Mb region and recombination of 1 cM/Mb. Box 2. Glossary of Linkage Disequilibrium We consider two loci on a chromosome. The causal locus has alleles C and c and the genotyped marker (SNP) has alleles M and m. These alleles have frequencies pC, 1−pC, pM,1−pM. The loci can make four possible haplotypes CM, cM, Cm, cm with frequencies pCM, pCg, pcM, pcm Linkage Equilibrium – When the frequencies of haplotypes are the frequencies expected from the random association of the alleles , e.g., pCM = pC pM Linkage disequilbrium (LD) – The non-random association between alleles on a chromosome, e.g., pCM >pC pM. Recombination breaks down linkage disequilbrium. Recombination – Chromosomal cross-over between the paired chromosomes during meiosis so that the chromosomes passed to offspring comprise a mixture of the chromosomes inherited from its two parents. If the cross-over event occurs between loci C and M, then the LD between them is broken down in the transmitted chromosome. It may take several generations or multiple recombination events to have a substantial impact on the LD in the population. Coupled allelesAlleles at two loci that tend to be found together on a chromosome. For example, a locus with one rare allele (rare allele C, common allele c), will usually only make three chromosomal haplotypes with any other locus (Minor allele M, major allele m): CM, cM,cm. In this example, the rare allele C is only found in the population coupled with the allele M. This is called complete LD. Recombination breaks down the coupling of alleles, so that all four haplotypes exist in the population. However, while there is linkage disequilibrium the coupled alleles are those making combinations of haplotypes with frequency greater than expected if there was linkage equilibrium. Measures of LD –The two commonly used measures of LD are r 2 and |D'|, both scale the covariance between the loci, D = pCM−pC pG, but in different ways. r 2 = D2/(pC pM (1−pC)(1−pM)), so r is the correlation between the loci, which scales D by the standard deviation of allelic frequency at the two loci. When pC < pM and C and M are coupled and |D'| =  D/pC(1-pM), so that D is scaled by the maximum allelic association possible given the allele frequencies at the two loci. Rare variants often make only three haplotypes with common SNPs, in this case r 2 can be close to zero while |D'| = 1. Perfect LD – When the alleles at one locus (C and c) have the same frequency as the alleles at another locus (M and m) and when the alleles are perfectly coupled so that only two haplotypes exist CM and cm. In this case r 2 = |D'| = 1. Complete LD – When the alleles at one locus (C and c) have different frequency from the alleles at another locus (M and m), but alleles from the C and M locus are coupled as much as is possible given the different alleles frequencies. In this case, only three haplotypes exist in the population e.g., CM,cM,cm. In this case |D'| =  1 and r 2 can range from very close to zero to 1 (when r 2 = 1, the allele frequencies of the two loci are equal and there is perfect LD). The value of r 2 depends on the allele frequency difference between the two loci. Maximum r 2 – The maximum r 2 possible between two loci given their allele frequencies occurs when the two loci make only three haplotypes so that there is complete LD. If C has the lowest frequency out of C, c, M and m and if allele C is coupled with allele M where M might be either the minor or major allele at this locus then the difference in allele frequencies between the couple loci is v  =  pM −pC. The maximum r 2 between them is [20]. If allele C is very rare then , and when pM is close to 0.5, . Tagging – When a genotyped SNP that is in LD with a non-genotyped variant, the genotyped SNP tags the non-genotyped variant. Coalescence theory – A population genetics model of inheritance relationships among alleles at a given locus. The coalescence of two alleles is the most recent point (going back in time) at which they shared a common ancestor. Simulation under coalescence theory is an efficient way to generate a realistic distribution of SNP frequencies and LD between them.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    150
    Citations
    NaN
    KQI
    []