Characterizing Performance Variation of Genomic Data Analysis Workflows on the Public Cloud

2020 
Public Infrastructure-as-a-Service (IaaS) clouds abstract the physical hardware implementation of resources provided to users. Users are not informed about the exact physical location of their virtual machines (VMs), the specific hardware used, the number of co-resident VMs they reside with, or the workloads that co-resident VMs are running. Detecting when VMs underperform can help identify resource contention from co-resident VMs to spur their replacement. In addition, resource utilization metrics may help classify performance of runs for use in VM performance model datasets that sample the distribution of performance outcomes. VM performance models are key to optimizing the cost of bioinformatics analyses in the public cloud. In this paper, we investigate performance variation of running big data genomics workflows in the public cloud. We examine causes of performance variation including VM provisioning, CPU heterogeneity, and resource contention. We leverage Amazon Elastic Compute Cloud placement groups, a feature designed to help influence VM placement on Amazon EC2 to help examine how VM placement impacts performance variation. As a use case, we investigate the performance of a multi-stage bioinformatics RNA sequencing (RNA-seq) analytical workflow consisting of four distinct phases, executing in ~90 minutes on average on 8-core public cloud VMs. In addition, we investigate whether Linux resource utilization metrics collected by profiling workflow runs can help identify performance variations.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []