Efficient identification of trait-associated loss-of-function variants in the UK Biobank cohort by exome-sequencing based genotype imputation
0
Citation
24
Reference
10
Related Paper
Abstract:
Abstract The large-scale open access whole-exome sequencing (WES) data of the UK Biobank ~200,000 participants is accelerating a new wave of genetic association studies aiming to identify rare and functional loss-of-function (LoF) variants associated with a broad range of complex traits and diseases, however the community is in short of stringent replication of new associations. In this study, we proposed to merge the WES genotypes and the genome-wide genotyping (GWAS) genotypes of 167,000 UKB Caucasian participants into a combined reference panel, and then to impute 241,911 UKB Caucasian participants who had the GWAS genotypes only. We then proposed to use the imputed data to replicate association identified in the discovery WES sample. Using a leave-100-out imputation strategy in the reference panel, we showed that average imputation accuracy measure r 2 is modest to high at LoF variants of all minor allele frequency (MAF) intervals including ultra-rare ones: 0.942 at MAF interval [1%, 50%], 0.807 at [0.1%, 1.0%), 0.805 at [0.01%, 0.1%), 0.664 at [0.001%, 0.01%) and 0.410 at (0, 0.001%). As applications, we studied single variant level and gene level associations of LoF variants with estimated heel BMD (eBMD) and 4 lipid traits: high-density-lipoprotein cholesterol (HDL-C), low-density-lipoprotein cholesterol (LDL-C), triglycerides (TG) and total cholesterol (TC). In addition to replicating dozens of previously reported genes such as MEPE for eBMD and PCSK9 for more than one lipid trait, the results also identified 2 novel gene-level associations: PLIN1 (cumulative MAF=0.10%, discovery BETA=0.38, P=1.20×10 −13 ; replication BETA=0.25, P=1.03×10 −6 ) and ANGPTL3 (cumulative MAF=0.10%, discovery BETA=−0.36, P=4.70×10 −11 ; replication BETA=−0.30, P=6.60×10 −11 ) for HDL-C, as well as one novel single variant level association (11:14843853:C:T, MAF=0.11%, discovery BETA=−0.31, P=2.70×10 −9 ; replication BETA=−0.31, P=8.80×10 −14 , PDE3B ) for TG. Our results highlighted the strength of WES based genotype imputation as well as provided useful imputed data within the UKB cohort.Keywords:
Imputation (statistics)
Genome-wide Association Study
Minor allele frequency
Trait
Genetic Association
Exome
ABSTRACT Exome association studies to date have generally been underpowered to systematically evaluate the phenotypic impact of very rare coding variants. We leveraged extensive haplotype sharing between 49,960 exome-sequenced UK Biobank participants and the remainder of the cohort (total N ~500K) to impute exome-wide variants at high accuracy ( R 2 >0.5) down to minor allele frequency (MAF) ~0.00005. Association and fine-mapping analyses of 54 quantitative traits identified 1,189 significant associations ( P <5 x 10 -8 ) involving 675 distinct rare protein-altering variants (MAF<0.01) that passed stringent filters for likely causality; 600 of the 675 variants (89%) were not present in the NHGRI-EBI GWAS Catalog. We replicated the effect directions of 28 of 28 height-associated variants genotyped in previous exome array studies, including missense variants in newly-associated collagen genes COL16A1 and COL11A2 . Across all traits, 49% of associations (578/1,189) occurred in genes with two or more hits; follow-up analyses of these genes identified long allelic series containing up to 45 distinct likely-causal variants within the same gene (on average exhibiting 93%-concordant effect directions). In particular, 24 rare coding variants in IFRD2 independently associated with reticulocyte indices, suggesting an important role of IFRD2 in red blood cell development, and 11 rare coding variants in NPR2 (a gene previously implicated in Mendelian skeletal disorders) exhibited intermediate-to-strong effects on height (0.18-1.09 s.d.). Our results demonstrate the utility of within-cohort imputation in population-scale GWAS cohorts, provide a catalog of likely-causal, large-effect coding variant associations, and foreshadow the insights that will be revealed as genetic biobank studies continue to grow.
Exome
Minor allele frequency
Imputation (statistics)
Genetic Association
Genome-wide Association Study
Linkage Disequilibrium
Cite
Citations (9)
Abstract The large-scale open access whole-exome sequencing (WES) data of the UK Biobank ~200,000 participants is accelerating a new wave of genetic association studies aiming to identify rare and functional loss-of-function (LoF) variants associated with a broad range of complex traits and diseases, however the community is in short of stringent replication of new associations. In this study, we proposed to merge the WES genotypes and the genome-wide genotyping (GWAS) genotypes of 167,000 UKB Caucasian participants into a combined reference panel, and then to impute 241,911 UKB Caucasian participants who had the GWAS genotypes only. We then proposed to use the imputed data to replicate association identified in the discovery WES sample. Using a leave-100-out imputation strategy in the reference panel, we showed that average imputation accuracy measure r 2 is modest to high at LoF variants of all minor allele frequency (MAF) intervals including ultra-rare ones: 0.942 at MAF interval [1%, 50%], 0.807 at [0.1%, 1.0%), 0.805 at [0.01%, 0.1%), 0.664 at [0.001%, 0.01%) and 0.410 at (0, 0.001%). As applications, we studied single variant level and gene level associations of LoF variants with estimated heel BMD (eBMD) and 4 lipid traits: high-density-lipoprotein cholesterol (HDL-C), low-density-lipoprotein cholesterol (LDL-C), triglycerides (TG) and total cholesterol (TC). In addition to replicating dozens of previously reported genes such as MEPE for eBMD and PCSK9 for more than one lipid trait, the results also identified 2 novel gene-level associations: PLIN1 (cumulative MAF=0.10%, discovery BETA=0.38, P=1.20×10 −13 ; replication BETA=0.25, P=1.03×10 −6 ) and ANGPTL3 (cumulative MAF=0.10%, discovery BETA=−0.36, P=4.70×10 −11 ; replication BETA=−0.30, P=6.60×10 −11 ) for HDL-C, as well as one novel single variant level association (11:14843853:C:T, MAF=0.11%, discovery BETA=−0.31, P=2.70×10 −9 ; replication BETA=−0.31, P=8.80×10 −14 , PDE3B ) for TG. Our results highlighted the strength of WES based genotype imputation as well as provided useful imputed data within the UKB cohort.
Imputation (statistics)
Genome-wide Association Study
Minor allele frequency
Trait
Genetic Association
Exome
Cite
Citations (0)
Genome-wide Association Study
Exome
Imputation (statistics)
Genetic Association
Cite
Citations (10)
Background In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci. Methodology A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project. Conclusions Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many "anchor" markers as possible.
Imputation (statistics)
Genome-wide Association Study
1000 Genomes Project
Genetic Association
SNP
Minor allele frequency
Cite
Citations (23)
In light of the complex nature of multiple sclerosis (MS) and the recently estimated contribution of low-frequency variants into disease, decoding its genetic risk components requires novel variant prioritization strategies. We selected, by reviewing MS Genome Wide Association Studies (GWAS), 107 candidate loci marked by intragenic single nucleotide polymorphisms (SNPs) with a remarkable association (p-value≤5X10⁻⁶). A whole exome sequencing (WES)-based pilot study of SNPs with minor allele frequency (MAF)≤0.04, conducted in three Italian families, revealed 15 exonic low-frequency SNPs with affected parent-child transmission. These variants were detected in 65/120 Italian unrelated MS patients, also in combination (22 patients). Compared with databases ("controls gnomAD, dbSNP150, ExAC, Tuscany-1000 Genome), the allelic frequencies of C6orf10 rs16870005 and IL2RA rs12722600 were significantly higher (i.e. controls gnomAD, p=9.89X10-7 and p<1X10-20). TET2 rs61744960 and TRAF3 rs138943371 frequencies were also significantly higher, except in Tuscany-1000 Genome. Interestingly, the association of C6orf10 rs16870005 (Ala431Thr) with MS did not depend on its linkage disequilibrium with the HLA-DRB1 locus. Sequencing in the MS cohort of the C6orf10 3' region revealed 14 rare mutations (10 not previously reported). Four variants were null, and significantly more frequent than in the databases. Further, the C6orf10 rare variants were observed in combinations, both intra-locus and with other low-frequency SNPs. The C6orf10 Ser389Xfr was found homozygous in a patient with early onset of the MS. Taking into account the potentially functional impact of the identified exonic variants, their expression in combination at the protein level could provide functional insights in the heterogeneous pathogenetic mechanisms contributing to MS.
Minor allele frequency
Linkage Disequilibrium
Genome-wide Association Study
Exome
Genetic Association
1000 Genomes Project
Cite
Citations (15)
Genome-wide Association Study
Imputation (statistics)
1000 Genomes Project
Human genetics
Human genetic variation
Minor allele frequency
Genetic Association
Exome
Cite
Citations (16)
Adult body height is a quantitative trait for which genome-wide association studies (GWAS) have identified numerous loci, primarily in European populations. These loci, comprising common variants, explain <10% of the phenotypic variance in height. We searched for novel associations between height and common (minor allele frequency, MAF ≥5%) or infrequent (0.5% < MAF < 5%) variants across the exome in African Americans. Using a reference panel of 1692 African Americans and 471 Europeans from the National Heart, Lung, and Blood Institute's (NHLBI) Exome Sequencing Project (ESP), we imputed whole-exome sequence data into 13 719 African Americans with existing array-based GWAS data (discovery). Variants achieving a height-association threshold of P < 5E−06 in the imputed dataset were followed up in an independent sample of 1989 African Americans with whole-exome sequence data (replication). We used P < 2.5E−07 (=0.05/196 779 variants) to define statistically significant associations in meta-analyses combining the discovery and replication sets (N = 15 708). We discovered and replicated three independent loci for association: 5p13.3/C5orf22/rs17410035 (MAF = 0.10, β = 0.64 cm, P = 8.3E−08), 13q14.2/SPRYD7/rs114089985 (MAF = 0.03, β = 1.46 cm, P = 4.8E−10) and 17q23.3/GH2/rs2006123 (MAF = 0.30; β = 0.47 cm; P = 4.7E−09). Conditional analyses suggested 5p13.3 (C5orf22/rs17410035) and 13q14.2 (SPRYD7/rs114089985) may harbor novel height alleles independent of previous GWAS-identified variants (r2 with GWAS loci <0.01); whereas 17q23.3/GH2/rs2006123 was correlated with GWAS-identified variants in European and African populations. Notably, 13q14.2/rs114089985 is infrequent in African Americans (MAF = 3%), extremely rare in European Americans (MAF = 0.03%), and monomorphic in Asian populations, suggesting it may be an African-American-specific height allele. Our findings demonstrate that whole-exome imputation of sequence variants can identify low-frequency variants and discover novel variants in non-European populations.
Genome-wide Association Study
Imputation (statistics)
Exome
Genetic Association
1000 Genomes Project
Minor allele frequency
Cite
Citations (14)
Genetic Association
Genetic architecture
Genome-wide Association Study
Exome
Minor allele frequency
Imputation (statistics)
Missing heritability problem
Cite
Citations (1,039)
Imputation (statistics)
Exome
Cite
Citations (193)
Exome
Imputation (statistics)
Genome-wide Association Study
Minor allele frequency
Genetic Association
1000 Genomes Project
Linkage Disequilibrium
Cite
Citations (140)