Accelerating large-scale genomic analysis with Spark

2016 
High-throughput next-generation sequencing technologies are producing a flood of cheap genomic information, providing precision medicine with the opportunity to better understand the primary cause of complicated diseases like cancer. However, even current state-of-the-art approaches still have large gaps with data generation due to limited scalability, accuracy and computational efficiency. To explore how to efficiently and effectively synthesize genomic data into knowledge, we propose GATK-Spark, a balanced parallelization approach that implements an in-memory version of GATK using Apache Spark. First, we performed a rigorous analysis of current GATK optimization strategies. We identify that compute resource utilization, text-based data format and long time single-thread file cutting and mergence operations are three major scalable bottlenecks. Second, we share our experiences designing a new approach optimized for GATK with big-data computing frameworks Apache Spark - GATK-Spark, which reduces the original execution of 20 hours to 30 minutes with a speedup in excess of 37 at 256 CPU cores. This work will facilitate the understanding of genomics analytics pipeline and design of strategies for accelerating large scale genomic analysis applications.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    7
    Citations
    NaN
    KQI
    []