Quotient filter

A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate member query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning 'element is in the set' when in fact the element is not present in the set (i.e., a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include 'insert' and 'optionally delete'. The more elements are added to the set, the larger the probability of false positives. A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate member query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning 'element is in the set' when in fact the element is not present in the set (i.e., a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include 'insert' and 'optionally delete'. The more elements are added to the set, the larger the probability of false positives. A typical application for quotient filters, and other AMQ filters, is to serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed. A quotient filter has the usual AMQ operations of insert and query. In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage). This property benefits certain kinds of log-structured merge-trees. The compact hash table underlying a quotient filter was described by Cleary in 1984. First known reference to using the structure as an AMQ filter is by Pagh et. al. in 2005. In 2009, Dillinger and Manolios optimized the structure's metadata, added in-place accommodation of more elements, and applied the structure to explicit-state model checking. In 2011, Bender et al. penned the name 'quotient filter' and applied the structure to database storage. The quotient filter is based on a kind of hash table in which entries contain only a portion of the key plus some additional meta-data bits. These bits are used to deal with the case when distinct keys happen to hash to the same table entry. By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key. In a quotient filter a hash function generates a p-bit fingerprint. The r least significant bits is called the remainder while the q = p - r most significant bits is called the quotient, hence the name quotienting (coined by Knuth.)The hash table has 2q slots. For some key d which hashes to the fingerprint dH, let its quotient be dQ and the remainder be dR.QF will try to store the remainder in slot dQ, which is known as the canonical slot.However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a hard collision—or because even when the keys' fingerprints are distinct they can have the same quotient—a soft collision. If the canonical slot is occupied then the remainder is stored in some slot to the right. As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a run. Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left. However a run whose first fingerprint occupies its canonical slot indicates the start of a cluster. The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster.

Parent Topic

Child Topic

No Parent Topic