FastDRC: Fast and Scalable Genome Compression Based on Distributed and Parallel Processing

2020 
With the advent of next-generation sequencing technology, sequencing costs have fallen sharply compared to the previous sequencing technologies. Genomic big data has become the significant big data application. In the face of growing genomic data, its storage and migration face enormous challenges. Therefore, researchers have proposed a variety of genome compression algorithms, but these algorithms cannot meet the processing requirements for large amount of biological data and high processing speed. This manuscript proposes a parallel and distributed referential genome compression algorithm-Fast Distributed Referential Compression (FastDRC). This algorithm compresses a large number of genomic sequences in parallel under the Apache Hadoop distributed computing framework. Experiments show that the compression efficiency of the FastDRC is greatly improved when it compresses large quantities of genomic data. Moreover, FastDRC leads to the only distributed computing method known to us in the field of genome compression. The source code for FastDRC can be obtained from this link: https://github.com/GhostCCCatHenry/FastDRC.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    3
    Citations
    NaN
    KQI
    []