Improve Spark-based Application Performance Using Minimizer

2020 
SpaRC(Spark Reads Clustering) is a generic sequence clustering algorithm based on Spark, which provides a scalable solution for billions of reads. However, SpaRC measures the correlation between reads by employing k-mer. This method can effectively complete computing tasks when the the amount of data is small. However, as the amount of data increases, the shortcomings of long running time and large memory resources are increasingly prominent. Here we explored a sequence similarity measurement method to alleviate these problems by using minimizer to measure sequence similarity between reads, without long running time and large memory resources. This method combines the minimizer measurement strategy and extracts the overlap rate information of reads to measure the sequence similarity between different reads, instead of the traditional method using k-mer. Results indicate that the method offers great improvement in clustering performance. Compared with the traditional k-mer method, this method can effectively improve the use of memory resources by SpaRC.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []