Optimization of data-intensive next generation sequencing in high performance computing

2015 
Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    3
    Citations
    NaN
    KQI
    []