木鸟杂记

大规模数据系统

Cuckoo Hash and Cuckoo Filter

Introduction

The essence of hashing is mapping from a larger space to a smaller space. Therefore, after inserting enough data, according to the pigeonhole principle, there will inevitably be position conflicts. Common hash tables (Hash Table or dictionary) handle conflicts through chaining, open addressing, and other methods. Single-bucket multi-function cuckoo hashing is a type of hash table that uses open addressing to handle conflicts; the difference is that when a conflict occurs, instead of linearly searching for a new position, it uses additional hash functions to find one.

Author: 木鸟杂记 https://www.qtmuniao.com/2021/12/07/cuckoo-hash-and-cuckoo-filter, please indicate the source when reposting

Principle

Cuckoo hashing was first presented by Rasmus Pagh and Flemming Friche Rodler at a conference in 2001 [1]. The basic idea is:

  1. Use two hash functions h1(x), h2(x) and two hash buckets T1, T2.
  2. Insert element x:
    1. If T1[h1(x)] or T2[h2(x)] is empty, insert into it; if both are empty, choose either one.
    2. If both T1[h1(x)] and T2[h2(x)] are full, randomly choose one (let it be y), kick it out, and insert x.
    3. Repeat the above process to insert element y.
    4. If there are too many kicks during insertion, it means the hash buckets are full. Then expand and rehash before inserting again.
  3. Query element x:
    1. Simply read T1[h1(x)] and T2[h2(x)] and compare with x.

cuckoo-insert.pngcuckoo-insert.png

The cuckoo (Cuckoo), also known as the great spotted cuckoo, likes to lay eggs in other birds’ nests. After hatching, cuckoo chicks will kick out the other eggs and grow up in the borrowed nest. The key design of cuckoo hashing lies in the “kick out” action—both the design and the name are truly inspired.

Variants

Cuckoo hashing can have many variants, such as:

  1. Using two hash functions and one hash bucket.
  2. Using two hash functions and four-way hash buckets.

cuckoo-hash.pngcuckoo-hash.png

It can be proven that the latter has higher bucket utilization; interested readers can check the original paper [3] for the proof.

Application

Bin Fan et al. from CMU published a paper in 2014 titled: Cuckoo Filter: Practically Better Than Bloom [3]. The basic idea is to apply the concept of cuckoo hashing (key-value queries) to the set membership direction, which can replace the commonly used Bloom Filter in engineering. It has the following advantages:

  1. Supports deleting elements
  2. Higher query efficiency, especially at high load factors
  3. Easier to implement compared to other filters that support deletion
  4. If the desired false positive rate is below 3%, it uses less space than Bloom Filter

To achieve the above effects, the Cuckoo Filter made the following changes to the original Cuckoo Hash:

  1. To improve bucket utilization, it uses multi-way hash buckets.
  2. To reduce memory usage, it only stores key fingerprints.

In addition, when a value x is kicked out, its other position needs to be found. Cuckoo Hash calculates this through an additional hash function applied to x: h2(x). However, in Cuckoo Filter, to save memory, a fixed-length fingerprint finger(x) is stored instead of the original value x. When x is kicked out, how is its other position found? Intuitively, I have two ideas:

Method 1: Calculate the other position through h2(finger(x)). But when calculating the second position, this is equivalent to forcibly compressing the original data space into the fingerprint space, which loses a lot of information and increases the collision probability.

Method 2: Record the other position in the value, pair(finger, the other position), but this greatly increases space usage.

The Cuckoo Filter uses a clever approach: it obtains the other position by XORing the position (h1(x)) with the hash of the corresponding value (hash(finger(x))). We know that XOR satisfies the property: XORing any two of the three values yields the third. When the value at the other position is kicked out, the original position can also be obtained through the same method, as shown in the figure below.

xor-position-val.pngxor-position-val.png

Why not directly XOR with finger(x) to calculate the other position? Because finger(x) generally has few bits, such as 8 bits. With this method, physically speaking, it means finding another position within ±2^8 = 256 of the original position, because XOR only changes the lower 8 bits. This range is too small and is not conducive to even distribution.

There are two additional points to note. First, without additional information, the Cuckoo Filter cannot be expanded, because we have lost the original value x and cannot calculate the new position hash(x) after expansion. Of course, if the corresponding key set is saved, it can also be reinserted after expansion. Second, if two different values x and y happen to satisfy hash(x) == hash(y) && finger(x) == finger(y), a false positive will occur. To understand false positives, we can start from the essence of hashing. As stated at the beginning, hashing is essentially mapping from a large space to a small space, so collisions will inevitably occur in the small space. If one of the two colliding original values exists, it will make people think the other also exists. Returning to the Cuckoo Filter, if both x and y exist in the Cuckoo Filter, when deleting x or y, deleting any one of the two identical fingerprints is sufficient.

Implementation

The implementation of the cuckoo filter is very simple; the paper also provides pseudocode, which is pasted here.

Insert(x)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);
if bucket[i1] or bucket[i2] has an empty entry then
add f to that bucket;
return Done;

// must relocate existing items;
i = randomly pick i1 or i2;
for n = 0; n < MaxNumKicks; n++ do
randomly select an entry e from bucket[i];
swap f and the fingerprint stored in entry e;
i = i ⊕ hash(f);
if bucket[i] has an empty entry then
add f to bucket[i];
return Done;

// Hashtable is considered full;
return Failure;

Lookup(x)

1
2
3
4
5
6
f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);
if bucket[i1] or bucket[i2] has f then
return True;
return False;

Delete(x)

1
2
3
4
5
6
7
f = fingerprint(x);
i1 = hash(x);
i2 = i1 ⊕ hash(f);
if bucket[i1] or bucket[i2] has f then
remove a copy of f from this bucket;
return True;
return False;

References

Cuckoo Hash and Cuckoo Filter

  1. Cuckoo Hashing: https://www.itu.dk/people/pagh/papers/cuckoo-jour.pdf
  2. Cuckoo Hashing Wikipedia: https://en.wikipedia.org/wiki/Cuckoo_hashing#cite_note-Cuckoo-1
  3. Cuckoo Filter: Practically Better Than Bloom https://www.cs.cmu.edu/~binfan/papers/conext14_cuckoofilter.pdf
  4. Cuckoo Filter Implementation: [https://coolshell.cn/articles/17225.html]


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg