A novel mathematical basis for predicting somatic single nucleotide variants from next-generation sequencing

2012 
Acute lymphoblastic leukemia (ALL) is the most common pediatric cancer and the leading cause of cancer-related death among children. Advances in the understanding of the pathobiology of ALL have led to risk-targeted treatment regimes and increased survival rates. However treatment is far from optimal. The advent of next-generation sequencing (NGS) technologies has enabled genome-wide identification of human disease-related variants and mutations. Using this technology, matched normal-tumor pairs from various cancers have been sequenced to detect pathogenic single nucleotide variant (SNV). Through our ongoing pediatric oncogenomics study, we took advantage of the SOLiD technology to conduct whole exome deep re-sequencing of a unique cohort of over 120 exomes of childhood ALL quartets, consisting of the patient's tumor and matched-normal material as well as DNA from both parents, to uncover the full spectrum of both germline and somatic genetic alterations in childhood ALL genomes. The difficulty, however, is the complex nature of the system. Finding tumor-specific mutations is not a straight-forward and easy task and appropriate analysis methods to distinguish somatic from germline mutations are needed. Although several algorithms and tools were designed to detect sequence variants in NGS data, there are still few mathematical approaches that can accurately estimate sensitivity and precision in tumor samples. Here, we present a novel unified probabilistic framework that allows us to effectively incorporate parental sequencing data to accurately detect leukemia-specific variations, by computing genotype likelihoods using base and mapping qualities, tumor purity estimates, and tumor mutation rates, estimated by numerical optimization methods. In contrast to existing methods, which rely solely on normal-tumor pairs to detect somatic-specific events, the incorporation of both matched-normal and parental information into our probabilistic model allows more accurate estimation of genotype probabilities and SNV error rates, and is of particular interest where familial data are available. Software for our java-based SNV calling algorithm will be freely available and released soon.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    1
    Citations
    NaN
    KQI
    []