木鸟杂记

大规模数据系统

Data Structures and Algorithms (II): Binary Search

Overview

My previous understanding of binary search was limited to finding a given integer in a sorted array. Later, I discovered that a whole class of problems can be solved using the binary search mindset. To generalize: if there exists a monotonic (mapping) relationship between the set where the desired result lies (the value range) and the set of numbers being searched (the domain), then the problem can be solved using the binary search idea. This sounds a bit abstract; I will illustrate it with several examples below.

Thanks to its efficiency of halving the problem size with each iteration, the binary search mindset has an extremely wide range of applications.

This article is divided into two major parts. The first part discusses the various details of binary search; the second part extends binary search to the general binary search mindset.

Author: Muniao’s Notes https://www.qtmuniao.com. Please indicate the source when reposting.

Binary Search — Corner Cases

Basic Code (C++)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int find(vector<int> &arr, int target) {
int left = 0, right = arr.size()-1;

while (left <= right) {
int mid = (left + right) / 2;

if (arr[mid] == target) {
return mid;
} else if (arr[mid] < target){
left = mid + 1;
} else {
right = mid - 1;
}
}

return -1;
}

Despite being such a plain and unremarkable piece of code, it has many variations in practice. The basic idea of binary search needs no further explanation; here we only analyze the corner cases. This involves a fundamental principle: we must give every element a chance to be searched, while avoiding infinite loops. Specifically, there are the following issues:

  1. Loop condition. When should left <= right be used, and when should left < right be used?
  2. Midpoint calculation. There are several common ways to calculate mid: mid = (left + right) / 2, mid = left + (right - left) / 2, and mid = (left + right + 1) / 2.
  3. Boundary movement. Both left and right have two ways to move. Taking left as an example: left = mid + 1 and left = mid.
  4. Final state. If the specified value is not found—for instance, searching for 4 in [1, 2, 3, 5]—then after the loop exits, left and right will point to 5 and 3 respectively. In other words, they end up on either side of the position where 4 should be inserted, with left > right.

The above issues are actually interrelated and should be combined appropriately depending on the problem at hand. For example, to find the left and right boundary indices of a number {3} in a sorted array {1, 2, 3, 3, 3, 5}: [2, 4]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
int find_range(vector<int> &arr, int target, bool left_range) {
int left = 0, right = arr.size() - 1;

while (left < right) {
int mid = left_range ? (left + right) / 2 : (left + right + 1) / 2;
if (arr[mid] == target) {
// Find left boundary: if we round right when calculate mid, then
// we will trap into infinite loop here(imagine
// we only have two elements in the array).
// Find right boundary: the similar reason.
left_range ? right = mid : left = mid;
} else if (arr[mid] < target) {
left = mid + 1;
} else {
right = mid - 1;
}
}

return left;
}

The above code is correct when the number being searched exists; if it does not exist, additional checks are required. What checks need to be added? This brings us to another point: when performing binary search, should we exit (return) as soon as we find the result midway, or continue until the loop condition is broken? The former is used for the simplest case of finding a specific value, while the latter is generally used for finding a boundary or an extremum that satisfies certain conditions. Since this problem exits because the loop condition left < right is violated, we need to check whether arr[left] equals target.

Therefore:

  1. The loop condition is related to the movement method and the check performed when the boundary is crossed.

  2. The midpoint calculation is related to whether overflow will occur. For example, if both left and right are very large, (left + right) / 2 is generally [prone to overflow].

  3. Boundary movement depends on whether we need the left and right boundary results when the loop finally exits, or merely need to narrow the range.

  4. The final state depends on whether we stop when a specific value is found, or keep halving until the boundaries narrow down and trigger an exit.

Careful consideration of the binary search process on a sorted array reveals several characteristics:

  1. The array values (search range) and their indices (search target) exhibit a monotonic mapping relationship.
  2. Each binary search step halves the current data scale, hence it is also called “half-interval search.”

Monotonic Mapping

Extending the binary search mindset, as long as the function formed by the values in the search range and the target values is monotonic, we can narrow down the range of the target value by comparing the search subject.

Summary

Some points haven’t been fully thought through yet; I will supplement them later.


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg