SmartRNASeqCaller: improving germline variant calling from RNAseq

2019 
Background: Transcriptomics data, often referred as RNA-Seq, are increasingly being adopted in clinical practice due to the opportunity to answer several questions with the same data -e.g. gene expression, splicing, allele-specific expression even without matching DNA. Indeed, recent studies showed how RNA-Seq can contribute to decipher the impact of germline variants. These efforts allowed to dramatically improved the diagnostic yield in specific rare disease patient cohorts. Nevertheless, RNA-Seq is not routinely adopted for germline variant calling in the clinic. This is mostly due to a combination of technical noise and biological processes that affect the reliability of results, and are difficult to reduce using standard filtering strategies. Results: To provide reliable germline variant calling from RNA-Seq for clinical use, such as for mendelian diseases diagnosis, we developed SmartRNASeqCaller: a Machine Learning system focused to reduce the burden of false positive calls from RNA-Seq. Thanks to the availability of large amount of high quality data, we could comprehensively train SmartRNASeqCaller using a suitable features set to characterize each potential variant. The model integrates information from multiple sources, capturing variant-specific characteristics, contextual information, and external sources of annotation. We tested our tool against state-of-the-art workflows on a set of 376 independent validation samples from GIAB, Neuromics, and GTEx consortia. SmartRNASeqCaller remarkably increases precision of RNA-Seq germline variant calls, reducing the false positive burden by 50% without strong impact on sensitivity. This translates to an average precision increase of 20.9%, showing a consistent effect on samples from different origins and characteristics. Conclusions: SmartRNASeqCaller shows that a general strategy adopted in different areas of applied machine learning can be exploited to improve variant calling. Switching from a naive hard-filtering schema to a more powerful, data-driven solution enabled a qualitative and quantitative improvement in terms of precision/recall performances. This is key for the intended use of SmartRNASeqCaller within clinical settings to identify disease-causing variants.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    1
    Citations
    NaN
    KQI
    []