A Cuckoo filter is a space-saving probabilistic data structure that checks whether an element is contained in a set. False-positive matches are possible, but false-negative matches are not. Thus, a query returns either “possibly in the set” or “definitely not in the set”. A Cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many elements and aim for a moderately low false positive rate, Cuckoo filters can have a smaller footprint than space-optimized Bloom filters.
The description of this filter comes from the paper Cuckoo Filter: Practically Better Than Bloom unless otherwise noted.
The Cuckoo filter is a compact variant of a Cuckoo hash table that stores only fingerprints (hash) of the data, instead of key/value pairs.
The size of the fingerprint depends on the false positive rate. The lower the false positive rate should be, the longer the fingerprint must be.
Cuckoo hashing #
There are two hash functions and one fingerprint function. The following then happens during insertion:
- Calculation of the first index to find the first bucket. Check if free, then insert and end.
2a) Calculation of the second index to find the second bucket. Check if free, then insert and end.
2b) If not free: swap the existing element with the new element and try to place the original element into its other bucket. If both buckets again are not free start with next element at 1. If last element is not placeable abort, otherwise end.\
An extension is that a bucket contains multiple elements that are linearly concatenated.
Cuckoo Filter Insert #
Since cuckoo hashing may require recalculation on insertion, the original element would have to be present. Since space must be saved, the recalculation must be based solely on the fingerprint.
The hash functions are:
\[h1(x) = hash(x)\] \[h2(x) = h1(x) \oplus hash(fingerprint of x)\]
Since the fingerprint is stored in the filter, the index of the other bucket can be calculated from the index of the current bucket ( \(h_1\) or \(h_2\) ) and the fingerprint. This is a property of \(\oplus\) (XOR).
The pseudocode looks like this:
fingerprint = fingerprint(x); bucket_index1 = hash(x); bucket_index2 = bucket_index1 XOR hash(fingerprint); if bucket[bucket_index1] or bucket[bucket_index2] == empty then add fingerprint to this bucket; return Done; // Relocation if no bucket fits bucket_index = randomly pick bucket_index1 or bucket_index2 ; for n = 0; n < MaxNumKicks; n++ do swap fingerprint of x and the fingerprint stored in bucket; bucket_index = bucket_index XOR hash(fingerprint_swapped_fingerprint); if bucket[bucket_index] has an empty entry then add fingerprint to bucket[bucket_index]; return Done; // Hashtable is full return Failure;
Cuckoo Filter Delete #
Deletion is possible in this filter. The two indices are calculated. If one of the buckets contains the fingerprint, it can be deleted.
Cuckoo Filter Lookup #
The two indices are calculated. If one of the buckets contains the fingerprint, the filter contains the element.