木鸟杂记

大规模数据系统

A Casual Discussion of LevelDB Data Structures (II): Bloom Filter

I had heard about LevelDB long ago. This time, on a whim, I skimmed through the code together with some reference materials, and it truly lives up to its reputation. If you are interested in storage, if you want to use C++ elegantly, or if you want to learn how to architect a project, I recommend taking a look. Not to mention that the authors are Sanjay Ghemawat and Jeff Dean.

If I don’t produce something after reading it, given my memory, I will surely forget it soon. So I wanted to write something about the wonders of LevelDB, but I didn’t want to take the usual path: starting with an architectural overview and ending with module analysis. These days of reading code, I kept thinking about where to start. Just as I finished reading, what impressed me most was actually the various exquisite data structures in LevelDB: fitting the scenario, built from scratch, appropriately tailored, and concise in code. Why not start the LevelDB series with these small corner components?

This series mainly wants to share three classic data structures commonly used in engineering in LevelDB: the Skip List used for fast read/write of memtable, the Bloom Filter used for fast filtering of sstable, and the LRUCache used for partially caching sstable. This is the second post, Bloom Filter.

Introduction

LevelDB is a single-machine KV storage engine, but it does not use a traditional balanced search tree to balance read/write performance. Instead, it uses the LSM-tree structure to organize data, sacrificing some read performance in exchange for higher write throughput. Let’s use a diagram to introduce how LSM-tree is organized across different storage media.

leveldb.pngleveldb.png

LevelDB divides data into two main parts, stored in memory and the file system respectively. The main data modules include WAL log, memtable, immutable memtable, and sstable. In order of data flow:

  1. When LevelDB receives a write request put(k, v), it first appends the operation log to the log file (WAL) for recovery in case of unexpected node crashes.
  2. After writing the WAL, LevelDB inserts this kv pair into the lookup structure in memory: memtable.
  3. After memtable accumulates to a certain degree, it rotates into a read-only memtable, i.e., immutable memtable; at the same time, a new memtable is generated for writing.
  4. When memory is under pressure, the immutable memtable is sequentially written to the file system, generating a level0 sstable (sorted strings table) file. This process is called a minor compaction.
  5. Since query operations need to traverse memtable, immutable, and sstables layer by layer. As more and more sstable files are generated, query performance will inevitably degrade, so different sstables need to be merged, which is called major compaction.

LevelDB Hierarchical Organization

All sstable files in the file system are logically organized by LevelDB into multiple levels (usually 7 levels), satisfying the following properties:

  1. The larger the level, the earlier its data was written. That is, data is first “placed” in the upper level (minor compaction), and when the upper level is “full” (reaches capacity limit), it “overflows” to the lower level for merging (major compaction).
  2. Each level has a limit on total file size, growing exponentially. For example, the total size limit of level0 is 10MB, level1 is 100MB, and so on. The highest level (level6) has no limit.
  3. Since each sstable file in level0 is directly flushed from memtable, the key ranges of multiple sstable files may overlap. For other levels, multiple sstable files are guaranteed not to overlap through certain rules.

For a read operation in LevelDB, it needs to first search memtable and immutable memtable, and then search each level in the file system in turn. It can be seen that compared to write operations, read operations are rather inefficient. We call the phenomenon where a single read request from the client is turned into multiple read requests inside the system read amplification.

To reduce read amplification, LevelDB takes several measures:

  1. Minimize sstable files through major compaction
  2. Use fast filtering methods to quickly determine whether a key is in a certain sstable file

And to quickly determine whether a key is in a key set, LevelDB uses exactly the Bloom filter. Of course, the Bloom filter can only quickly determine that a key is definitely not in a certain sstable, thereby skipping certain sstables during layer-by-layer lookups. The reason will be detailed later; we will leave it here for now.

Author: 木鸟杂记 https://www.qtmuniao.com/2020/11/18/leveldb-data-structures-bloom-filter, please indicate the source when reposting

Principles

Bloom Filter was proposed by Burton Howard Bloom in 1970. The related paper is: [Space/time trade-offs in hash coding with allowable errors](Space/time trade-offs in hash coding with allowable errors). By yielding only a little accuracy, it gains efficiency in time and space — a truly ingenious design.

Data Structure

Bloom Filter uses a bit array underneath. Initially, when the represented set is empty, all bits are 0:

empty-bloom-filter.pngempty-bloom-filter.png

When inserting an element x into the set, k independent hash functions are used to hash x respectively, and then the k hash values are taken modulo the array length to set the corresponding positions in the array to 1:

set-x-bloom-filter.pngset-x-bloom-filter.png

The lookup process is similar to the insertion process. It also uses the same k hash functions to hash the element to be searched in sequence, obtaining k positions. If all k positions in the bit array are 1, the element may exist; otherwise, if any position has value 0, the value definitely does not exist. For the figure below, x1 may exist, x2 definitely does not exist.

get-x-bloom-filter.pngget-x-bloom-filter.png

After continuously inserting some elements, a large number of positions in the array will be set to 1. It is imaginable that there will definitely be some position conflicts, causing false positives. Using k independent hash functions can partially solve this problem. However, if for some value y, the k positions calculated by hash(y) happen to have been set to 1 by several elements inserted at other times, it will be falsely judged that y is in the set.

Time and Space Advantages

Compared to other data structures representing datasets, such as balanced binary search trees, Trie trees, hash tables, or even simpler arrays or linked lists, Bloom Filter has huge time and space advantages. The data structures mentioned above mostly need to store the data items themselves, some may be compressed, such as Trie trees. But Bloom Filter takes another path: it does not store the data items themselves, but stores several hash values of the data items, and uses an efficient and compact bit array to store them, avoiding the extra overhead of pointers.

This design makes the size of Bloom Filter independent of the size of the data items themselves (such as the length of strings). For example, for a Bloom Filter with 1% error rate and optimal k (number of hash functions), each element requires only 9.6 bits on average.

The gain of this advantage can be understood as ignoring collision handling on the basis of a hash table, thereby saving extra overhead.

Parameter Trade-offs

From the above, we can roughly feel that the false positive rate of Bloom Filter should be related to the following parameters:

  1. The number of hash functions k
  2. The length of the underlying bit array m
  3. The dataset size n

The derivation of the relationship between these parameters and the false positive rate is not detailed here. You can refer to Wikipedia, or this CSDN article. Here we directly give the conclusion:

When k = ln2 * (m/n), Bloom Filter achieves optimal accuracy. m/n is bits per key (the average number of bits allocated to each key in the set).

Source Code

After laying the background and basic principles of Bloom Filter, let’s see how LevelDB source code embeds it into the system.

General Interface

To reduce read amplification, especially disk access read amplification, LevelDB abstracts a FilterPolicy interface to quickly filter out unsatisfied SSTables when looking up keys. This Filter information is stored together with data in the SSTable file and loaded into memory when needed, which requires that the Filter does not occupy too much space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class LEVELDB_EXPORT FilterPolicy {
public:
virtual ~FilterPolicy();

// 过滤策略的名字,用来唯一标识该 Filter 持久化、载入内存时的编码方法。
virtual const char* Name() const = 0;

// 给长度为 n 的 keys 集合(可能有重复)创建一个过滤策略,并将策略序列化为 string ,追加到 dst 最后。
virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const = 0;

// “过滤” 函数。若调用 CreateFilter 时传入的集合为 keys,则如果 key 在 keys 中,则必须返回 true。
// 若 key 不在 keys 中,可以返回 true,也可以返回 false,但最好大概率返回 false。
virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};

Abstracting this interface allows users to implement other filtering strategies according to their own needs. Naturally, LevelDB provides a built-in filtering policy that implements this interface: BloomFilterPolicy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class BloomFilterPolicy : public FilterPolicy {
public:
explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
// 根据上面提到的结论:k = ln2 * (m/n),获取哈希函数的个数 k。
// 这里对 k 进行了向下取整、限制最大值,增加了一点误判率,但是也降低了过滤成本。
k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
if (k_ < 1) k_ = 1;
if (k_ > 30) k_ = 30;
}

const char* Name() const override { return "leveldb.BuiltinBloomFilter2"; }

void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
// 计算 bloom filter 的 bit 数组长度 n,会除以 8 向上取整,因为 bit 数组最后会用 char 数组表示
size_t bits = n * bits_per_key_;
if (bits < 64) bits = 64; // 如果数组太短,会有很高的误判率,因此这里增加了一个最小长度限定。
size_t bytes = (bits + 7) / 8;
bits = bytes * 8;

const size_t init_size = dst->size();
dst->resize(init_size + bytes, 0);
dst->push_back(static_cast<char>(k_)); // 记下哈希函数的个数
char* array = &(*dst)[init_size];
for (int i = 0; i < n; i++) {
// 使用 double-hashing 方法,仅使用一个 hash 函数来生成 k 个 hash 值,近似等价于使用 k 个哈希函数的效果
// 详细分析可以参考:https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // 循环右移 17 bits 作为步长
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos / 8] |= (1 << (bitpos % 8));
h += delta;
}
}
}

bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
const size_t len = bloom_filter.size();
if (len < 2) return false;

const char* array = bloom_filter.data();
const size_t bits = (len - 1) * 8; // -1 是去掉 k 所占空间

// 取出创建 Filter 时保存的哈希函数个数 k
const size_t k = array[len - 1];
if (k > 30) {
// 超过我们设定 k 个数,直接返回 true,不滤掉该 SSTable.
return true;
}

uint32_t h = BloomHash(key);
const uint32_t delta = (h >> 17) | (h << 15); // 循环右移 17 bits 作为步长
for (size_t j = 0; j < k; j++) {
const uint32_t bitpos = h % bits;
if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
h += delta;
}
return true;
}

private:
size_t bits_per_key_;
size_t k_;
};
}

According to the above source code, there are several points to note in LevelDB’s implementation:

  1. The bit array mentioned earlier is represented using a char array in C++ in LevelDB. Therefore, when calculating the bit array length, it needs to be aligned to a multiple of 8, and appropriate conversion is needed when calculating indices.
  2. LevelDB does not actually use k hash functions in its implementation, but uses the double-hashing method for an optimization, claiming to achieve similar accuracy.

Summary

Bloom Filter is usually used to quickly determine whether an element is in a set. It essentially exchanges tolerance for a certain error rate for time and space efficiency. From an implementation perspective, it can be understood as saving the collision handling part on the basis of a hash table, and reducing false positives through k independent hash functions. LevelDB uses an optimization in its implementation: using one hash function to achieve an effect approximating that of k hash functions.

In the next post, I will continue to analyze another classic data structure used in LevelDB: LRU cache (LRU cache).

References

  1. LevelDB handbook:https://leveldb-handbook.readthedocs.io/zh/latest/bloomfilter.html
  2. Wikipedia:https://en.wikipedia.org/wiki/Bloom_filter

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg