Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed

Kennon J. Conrad,Paul R. Wilson

Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed

2016

GLZA is a free, open-source, enhanced grammar-based compressor that constructs a low entropy grammar amenable to entropy coding, using a greedy hill-climbing search guided by estimates of encoded string lengths; the estimates are efficiently computed incrementally during (parallelized) suffix tree construction in a batched iterative repeat replacement cycle. The grammar-coded symbol stream is further compressed by order-1 Markov modeling of trailing/leading subsymbols and selective recency modeling, MTF-coding only symbols that tend to recur soon. This combination results in excellent compression ratios—similar to PPMC's for small files, averaging within about five percent of PPMd's for large text files (1 MB – 10 MB)—with fast decompression on one core or two. Compression time and memory use are not dramatically higher than for similarly high-performance asymmetrical compressors of other kinds. GLZA is on the Pareto frontier for text compression ratio and decompression speed on a variety of benchmarks (LTCB, Calgary, Canterbury, Large Canterbury, Silesia, Maximum Compression, World Compression Challenge), compressing better and/or decompressing faster than its competitors (PPM, LZ77-Markov, BWT, etc.), with better compression ratios than previous grammar-based compressors such as RePair, Sequitur, Offline 3 (Greedy), Sequential/grzip, and IRR-S.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations