Fast nGram-based string search over data encoded using algebraic signatures
2007
We propose a novel string search algorithm for data stored once and read many times. Our search method combines the sublinear traversal of the record (as in Boyer Moore or Knuth-Morris-Pratt) with the agglomeration of parts of the record and search pattern into a single character -- the algebraic signature -- in the manner of Karp-Rabin. Our experiments show that our algorithm is up to seventy times faster for DNA data, up to eleven times faster for ASCII, and up to a six times faster for XML documents compared with an implementation of Boyer-Moore. To obtain this speed-up, we store records in encoded form, where each original character is replaced with an algebraic signature.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
15
References
0
Citations
NaN
KQI