Probability estimation and compression involving large alphabets

2006 
Author(s): Santhanam, Narayana | Abstract: Many results in statistics and information theory are asymptotic in nature, with the implicit assumption that we operate in a regime where the data size is much larger than the alphabet size. In this dissertation, we will be concerned with large alphabets, namely alphabets for which the above assumption does not hold. We consider universal compression, i.e., compression when data statistics are unknown, and probability estimation involving data drawn from large alphabets. Both these problems are tackled using a notion of the structure of the string, the string's pattern. For example the pattern of "abracadabra" is 12314151231. It has long been known that universal compression of even independent identically distributed strings incurs unbounded extra number of bits over the entropy of the source as the alphabet size grows. For such applications, we describe new approaches that isolate the pattern of the string from the dictionary. These approaches are analyzed using results in analysis as well as those on integer partitions studied by Hardy and Ramanujan. We then consider a related problem of estimating a distribution over a large alphabet using samples drawn from it. The problem considered was posed by a prolific statistician, I.J. Good and Alan Turing. The large alphabet size renders practical sample sizes too small for conventional approaches, hence Good and Turing developed new estimators (without proof) for this problem. The Good-Turing estimator is empirically known to work well. We use the framework developed for the universal compression problem above and provide an explanation of why the estimator developed by Good and Turing works well and propose other provably optimal variants. We show that for a large class of processes, the entropy rate of patterns equals the process entropy rate. We also state an asymptotic equipartition property for patterns
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    16
    Citations
    NaN
    KQI
    []