Constructing Suffix Array of Next-Generation Sequencing upon In-Memory Lookup Cloud and MapReduce

2019 
TeraSort [7] is a standard MapReduce sort which is applied as a benchmark to measure the time to sort terabytes of randomly distributed data. TeraSuffix [5] adopts TeraSort to construct suffix array for NGS (Next-Generation Sequencing). When using TeraSuffix to construct suffix array for NGS, as the intermediate data of the MapReduce framework contains multiple copies of suffixes, the shuffle of the intermediate data between map and reduce become a bottleneck. For a suffix can be represented by its index to the NGS Reads data, it's no need to record suffix as the intermediate data and thus reduce the shuffle time. Disk-based Indexed TeraSuffix [6] adopts this index structure to represent a suffix and stores the NGS Reads data on disk. However, when constructing the suffix array, reduce tasks still need lots of random access from disk to retrieve suffixes for further processing. The massive disk I/O operations become a bottleneck. To increase the efficiency of the Disk-based Indexed TeraSuffix, in-memory lookup cloud (MLC) is proposed in this paper to store the NGS Reads data on the memory of remote servers in a cloud. When a map/reduce task needs to retrieve a suffix, it can access the suffix from the memory of MLC through the network. Experimental tests were performed to show that the access of suffix through network is outperformed than from disk. Experiments were also performed on Amazon Elastic MapReduce with the sequence of 20Gbp-Grouper (about 20Gbytes). It showed that the proposed architecture reduces the pre-processing time of data replication and the processing time of reduce tasks by 58% and 8%, respectively. It improves both the time and the space efficiency of TeraSort for constructing suffix array.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    0
    Citations
    NaN
    KQI
    []