Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO).
2019
Whole-genome shotgun sequencing (WGS), especially that of microbial genomes, has been the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of read classification and sequence search on these ultra large-scale datasets, including both raw reads and assembled genomes. A notable recent method is BItsliced Genomic Signature Index (BIGSI). It is a data-structure of array of Bloom Filters and offers very efficient query sequence search times. However, querying with BIGSI still requires probing Bloom Filters (or sets of bitslices) which scales linearly with the number of datasets. As a result, querying complexity of BIGSI scales linearly with the number of files in a dataset. In this paper, we propose a sequence search method based on Repeated and Merged BloOm Filter (RAMBO). Here the number of Bloom filter probes is significantly less than BIGSI due to sub-linear scaling for the same false-positive rate. Our idea is theoretically sound and inspired by the count-min sketch data structure, a popular streaming algorithm. RAMBO provides a significant improvement over BIGSI in terms of query time when evaluated on real genome datasets. Furthermore the insertion and query process supports parallelism. Due to the sub-linear scaling of query time, the larger the size and number of datasets, the bigger the gains are with RAMBO over BIGSI.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
24
References
0
Citations
NaN
KQI