Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

2019 
Background: The human genome contains 9dark9 gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are 9dark by depth9 (few mappable reads) and others that are 9camouflaged9 (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer9s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle. Results: Based on standard whole-genome Illumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer9s disease gene. Conclusions: While we could not formally assess the CR1 frameshift mutation in Alzheimer9s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    116
    References
    76
    Citations
    NaN
    KQI
    []