Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Abstract Disparities in data underlying clinical genomic interpretation is an acknowledged problem, but there is a paucity of data demonstrating it. The All of Us Research Program is collecting data including whole-genome sequences, health records, and surveys for at least a million participants with diverse ancestry and access to healthcare, representing one of the largest biomedical research repositories of its kind. Here, we examine pathogenic and likely pathogenic variants that were identified in the All of Us cohort. The European ancestry subgroup showed the highest overall rate of pathogenic variation, with 2.26% of participants having a pathogenic variant. Other ancestry groups had lower rates of pathogenic variation, including 1.62% for the African ancestry group and 1.32% in the Latino/Admixed American ancestry group. Pathogenic variants were most frequently observed in genes related to Breast/Ovarian Cancer or Hypercholesterolemia. Variant frequencies in many genes were consistent with the data from the public gnomAD database, with some notable exceptions resolved using gnomAD subsets. Differences in pathogenic variant frequency observed between ancestral groups generally indicate biases of ascertainment of knowledge about those variants, but some deviations may be indicative of differences in disease prevalence. This work will allow targeted precision medicine efforts at revealed disparities.
Abstract Nonsyndromic oculocutaneous Albinism (nsOCA) is clinically characterized by the loss of pigmentation in the skin, hair, and iris. OCA is amongst the most common causes of vision impairment in children. To date, pathogenic variants in six genes have been identified in individuals with nsOCA. Here, we determined the identities, frequencies, and clinical consequences of OCA alleles in 94 previously unreported Pakistani families. Combination of Sanger and Exome sequencing revealed 38 alleles, including 22 novel variants, segregating with nsOCA phenotype in 80 families. Variants of TYR and OCA2 genes were the most common cause of nsOCA, occurring in 43 and 30 families, respectively. Twenty-two novel variants include nine missense, four splice site, two non-sense, one insertion and six gross deletions. In vitro studies revealed retention of OCA proteins harboring novel missense alleles in the endoplasmic reticulum (ER) of transfected cells. Exon-trapping assays with constructs containing splice site alleles revealed errors in splicing. As eight alleles account for approximately 56% (95% CI: 46.52–65.24%) of nsOCA cases, primarily enrolled from Punjab province of Pakistan, hierarchical strategies for variant detection would be feasible and cost-efficient genetic tests for OCA in families with similar origin. Thus, we developed Tetra-primer ARMS assays for rapid, reliable, reproducible and economical screening of most of these common alleles.
Abstract Background Colorectal cancer (CRC) is a complex disease with monogenic, polygenic and environmental risk factors. Polygenic risk scores (PRS) are being developed to identify high polygenic risk individuals. Due to differences in genetic background, PRS distributions vary by ancestry, necessitating calibration. Methods We compared four calibration methods using the All of Us Research Program Whole Genome Sequence data for a CRC PRS previously developed in participants of European and East Asian ancestry. The methods contrasted results from linear models with A) the entire data set or an ancestrally diverse training set AND B) covariates including principal components of ancestry or admixture. Calibration with the training set adjusted the variance in addition to the mean. Results All methods performed similarly within ancestry with OR (95% C.I.) per s.d. change in PRS: African 1.5 (1.02, 2.08), Admixed American 2.2 (1.27, 3.85), European 1.6 (1.43, 1.89), and Middle Eastern 1.1 (0.71, 1.63). Using admixture and an ancestrally diverse training set provided distributions closest to standard Normal with accurate upper tail frequencies. Conclusion Although the PRS is predictive of CRC risk for most ancestries, its performance varies by ancestry. Post-hoc calibration preserves the risk prediction within ancestries. Training a calibration model on ancestrally diverse participants to adjust both the mean and variance of the PRS, using admixture as covariates, created standard Normal z-scores. These z-scores can be used to identify patients at high polygenic risk, and can be incorporated into comprehensive risk scores including other known risk factors, allowing for more precise risk estimates.
Abstract Pharmacogenomics promises improved outcomes through individualized prescribing. However, the lack of diversity in studies impedes clinical translation and equitable application of precision medicine. We evaluated the frequencies of PGx variants, predicted phenotypes, and medication exposures using whole genome sequencing and EHR data from nearly 100k diverse All of Us Research Program participants. We report 100% of participants carried at least one pharmacogenomics variant and nearly all (99.13%) had a predicted phenotype with prescribing recommendations. Clinical impact was high with over 20% having both an actionable phenotype and a prior exposure to an impacted medication with pharmacogenomic prescribing guidance. Importantly, we also report hundreds of alleles and predicted phenotypes that deviate from known frequencies and/or were previously unreported, including within admixed American and African ancestry groups.