Modeling word occurrences for the compression of concordances

1995 
Summary form only given. Effective compression of a text-based information retrieval system involves compression not only the text itself, but also of the concordance by which one accesses that text and which occupies an amount of storage comparable to the text itself. The concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components of the hierarchy can usually be conceptualized as a bit-map. We conceptualize our bit-map as being generated as follows. At any bit-map site we are in one of two states: a cluster state (C), or a between-cluster state (B). In a given state, we generate a bit-map-value of zero or one and, governed by the transition probabilities of the model, enter a new state as we move to the next bit-map site. Such a model has been referred to as a hidden Markov model in the literature. Unfortunately, this model is analytically difficult to use. To approximate it, we introduce several traditional Markov models with four states each, B and C as above, and two transitional states. We present the models, show how they are connected, and state the formal compression algorithm based on these models. We also include some experimental results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []