Tree Contraction for Compressed Suffix Arrays on Modern Processors

2015 
We propose a novel processor-aware compaction technique for pattern matching that is widely-used in databases, information retrieval, and text mining. As the amount of data increases, it is getting important to efficiently store data on memory. A compressed suffix array (CSA) is a compact data structure for in-memory pattern matching. However, CSA suffers from tremendous processor penalties, such as a flood of instructions and cache/TLB misses due to the lack of processor-aware design. To mitigate these penalties, we propose a novel compaction technique for CSA, called suffix trie contraction (STC). The frequently accessed suffixes of CSA are transformed to a trie (e.g., a suffix trie), and then inter-connected nodes in the trie are repeatedly ’\(contracted\)’ to a single node, which enables lightweight sequential scans in a processor-friendly way. In detail, STC consists of two contraction techniques: fixed-length path contraction (FPC) and sub-tree contraction (SC). FPC is applied to the parts with a few branches in the trie, and SC is applied to the parts with many branches. Our experiment results indicate that FPC outperforms naive CSA by two orders of magnitude for short pattern queries and by three times for long pattern queries. As the number of branches inside the trie increases, SC gradually becomes superior to CSA and FPC for short pattern queries. Finally, the latency and throughput of STC are 7 times and 72 times better than those of CSA for the TREC test data set at the expense of additional 7.1 % space overhead.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []