木鸟杂记

大规模数据系统

MIT 6.824 2020 Video Notes 2: RPC and Threads

6.824-schedule.png6.824-schedule.png

MIT finally released the lecture videos on YouTube this year. I had followed about half of this course before, and this year I plan to watch through the videos and take some lecture notes. The course is structured around the fundamentals of distributed systems—fault tolerance, replication, and consistency—using carefully selected industrial-grade system papers as the backbone, supplemented with detailed reading materials and well-crafted labs, bridging academic theory and industrial practice. It is truly an excellent distributed systems course. Course videos: YouTube, Bilibili. Course materials: 6.824 homepage. This post is the notes for the second lecture, RPC and threads.

Why Go

  1. Modern syntax. It has built-in language-level support for threads (goroutines) and channels. It also has good support for locking and synchronization between threads.
  2. Type safe. Memory safe, making it hard to write bugs like out-of-bounds memory accesses as in C++.
  3. Garbage collection (GC). No need for manual memory management, which is especially important in multi-threaded programming, since it is easy to reference some memory and then forget where it was referenced.
  4. Concise and intuitive. Not as many complex language features as C++, and it is very friendly with error messages.

Author: 木鸟杂记 https://www.qtmuniao.com/2020/02/29/6-824-video-notes-2/, please indicate the source when reposting

Threads

Why are threads so important? Because they are our primary means of controlling concurrency, and concurrency is the foundation of building distributed systems. In Go, you can think of a goroutine as a thread—the two terms are used interchangeably below. Each thread can have its own memory stack and registers, but they can share a single address space.

Reasons for Use

IO concurrency: A historical term; in the single-core era, IO was the main bottleneck. To make full use of the CPU, when one thread is performing IO, it can yield the CPU so that another thread can perform computation, read, or send network messages, etc. Here, it can be understood as: you can use multiple threads to send multiple network requests in parallel (such as RPC, HTTP, etc.), and then wait for their responses separately.

Parallelism: Make full use of multi-core CPUs.

Regarding the difference and relationship between concurrency and parallelism, you can read this article. Just remember two keywords: logical concurrent design vs. physical parallel execution.

Convenience: For example, you can start a thread in the background to execute something periodically, or to check something regularly (such as a heartbeat).

Q&A:

  1. How can concurrency be handled without using threads? Event-driven asynchronous programming. However, the multi-threaded model is easier to understand, since the execution order inside each thread is roughly consistent with your code order.
  2. What is the difference between a process and a thread? A process is an abstraction provided by the operating system that includes an independent address space. A Go program starts as a process and can launch many threads (though I recall that goroutines are user-space execution flows).

Challenges

Shared memory is error-prone. A classic problem is that when multiple threads execute the statement n = n + 1 in parallel, since this operation is not atomic, without locking, it is easy for n to end up with an unexpected value.

We call this situation a race: two or more threads simultaneously trying to modify a shared variable.

The solution is to add locks, but how to add locks scientifically to balance performance and avoid deadlocks is another subject.

Q&A:

  1. Does Go know the mapping between locks and resources (some shared variables)? Go does not know; it simply waits for the lock, acquires the lock, and releases the lock. Programmers need to maintain this themselves in their minds and logic.
  2. Does Go lock all variables of an object or only some? Same as the previous question—Go does not know any relationship between locks and variables. The primitive of Lock itself is very simple: when goroutine0 calls mu.Lock, if no other goroutine holds the lock, goroutine0 acquires it; if another goroutine holds the lock, it waits until the lock is released. Of course, in some languages, such as Java, objects or instances are bound to locks to indicate the scope of the lock.
  3. Should a Lock be a private variable of some object? If possible, it is best to do so. But if there is a need for cross-object locking, it needs to be exposed, while being careful to avoid deadlocks.

Coordination

  1. channels: The more recommended approach in Go, available in blocking and buffered forms.
  2. sync.Cond: A signaling mechanism.
  3. waitGroup: Blocks until a group of goroutines has finished executing, which will be mentioned later.

Deadlock

Conditions for occurrence: multiple locks, circular dependencies, and hold-and-wait.

If your program stops working but hasn’t crashed, you need to check whether it is deadlocked.

Web Crawler

  1. Start from a seed web page URL.

  2. Fetch its content text via an HTTP request.

  3. Parse all URLs contained in the content, and repeat steps 2 and 3 for all URLs.

    To avoid fetching the same page repeatedly, all fetched URLs need to be recorded.

Because:

  1. The number of web pages is huge.
  2. Network requests are slow.
    Fetching one by one takes too long, so parallel fetching is needed. The difficulty here is how to determine that all web pages have been fetched and that crawling should end.

Crawler Code

The code is available in the reading materials.

Serial crawling. Perform a depth-first traversal (DFS) of the graph structure formed by all web pages, using a set named fetched to store all URLs that have already been crawled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func Serial(url string, fetcher Fetcher, fetched map[string]bool) {
if fetched[url] {
return
}
fetched[url] = true
urls, err := fetcher.Fetch(url)
if err != nil {
return
}
for _, u := range urls {
Serial(u, fetcher, fetched)
}
return
}

Concurrent crawling

  1. Turn the fetching part into a parallel operation using the go keyword. But if only this change is made, without using some mechanism (such as sync.WaitGroup) to wait for child goroutines before returning, then it may only crawl the seed URL, while also causing child goroutines to leak.
  2. If accessing the already-fetched URL set fetched without a lock, it is very likely to fetch the same page multiple times.

WaitGroup

1
2
3
4
5
6
7
8
9
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(u string) {
defer done.Done()
ConcurrentMutex(u, fetcher, f)
}(u) // u is copied
}
done.Wait()

WaitGroup internally maintains a counter: calling wg.Add(n) increases it by n; calling wg.Done() decreases it by 1. Calling wg.Wait() will block until the counter reaches 0. Therefore, WaitGroup is well suited for scenarios where you need to wait for a group of goroutines to finish.

Q&A

  1. What if a goroutine exits abnormally without calling wg.Done()? You can use defer at the beginning of the goroutine: defer wg.Done().

  2. If two goroutines call wg.Done() at the same time, could there be a race so that the internal counter is not decremented correctly twice? WaitGroup should have corresponding mechanisms (such as locks) to ensure the atomicity of Done().

  3. What is the relationship between variables in an anonymous function and variables of the same name in the outer function? This is a closure issue. If a variable in the anonymous function is not overridden by a parameter (such as fetcher in the code above), it will refer to the same address as the variable of the same name in the outer scope. If it is passed via a parameter (such as u in the code above), even if the parameter looks the same as the outer variable, the anonymous function uses the passed-in parameter, not the outer variable. Especially for loop variables, we usually pass them via parameters to make a copy at the time of the call; otherwise, all goroutines started by the for loop will point to this variable that is constantly being reassigned by the loop.

    For closures, Go has a concept called “Variable Escape”: if a variable is still referenced after the function’s lifecycle ends, it is allocated on the heap rather than the function stack. For closures, if a variable is referenced by both the inner and outer functions, it will be allocated on the heap.

  4. Since the string u is immutable, why do all goroutines still reference a constantly changing value? A string is indeed immutable, but the value of u keeps changing, and the goroutine shares the reference to u with the outer goroutine.

Removing Locks

If you remove the lock when updating the map and run it a few times, you may find nothing abnormal, because races are actually hard to detect. Fortunately, Go provides a race detection tool to help you find potential races: go run -race crawler.go.

Note that this tool does not perform static analysis; instead, it observes and records the execution traces of each goroutine during dynamic execution and analyzes them.

Number of Threads

Q&A

  1. How many threads (goroutines) are running concurrently during the entire execution of this code?
    The code does not impose any obvious limit, but it is clearly positively correlated with the number of URLs and the fetching time. In the example, the input contains only five URLs, so there is no problem. But in reality, doing this might launch millions of goroutines at the same time. Therefore, one improvement is to start a fixed number of worker goroutines in a pool; each worker asks for or is assigned the next task after finishing its current one.

Communicating via Channels

We can implement a new version of the crawler that does not use locks + shared variables, but instead uses Go’s built-in syntax: channels for communication. The specific approach is similar to implementing a producer-consumer model, using a channel as a message queue.

  1. Initially, put the seed URL into the channel.
  2. Consumer: the master continuously takes URLs from the channel, checks whether they have been fetched, and then starts a new worker goroutine to crawl them.
  3. Producer: the worker goroutine crawls the given task URL and puts the parsed result URLs back into the channel.
  4. The master uses a variable n to track the number of outstanding tasks; increment by one when a task is issued; decrement by one when a result is retrieved from the channel and processed (i.e., scheduled to a worker); exit the program when all tasks are finished.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func worker(url string, ch chan []string, fetcher Fetcher) {
urls, err := fetcher.Fetch(url)
if err != nil {
ch <- []string{}
} else {
ch <- urls
}
}

func master(ch chan []string, fetcher Fetcher) {
n := 1
fetched := make(map[string]bool)
for urls := range ch {
for _, u := range urls {
if fetched[u] == false {
fetched[u] = true
n += 1
go worker(u, ch, fetcher)
}
}
n -= 1
if n == 0 {
break
}
}
}

func ConcurrentChannel(url string, fetcher Fetcher) {
ch := make(chan []string)
go func() {
ch <- []string{url}
}()
master(ch, fetcher)
}

Q&A:

  1. The master reads from the channel and multiple workers write to the channel—won’t there be a race? Channels are thread-safe.
  2. Does the channel not need to be closed in the end? We track the number of all executing tasks with n. Therefore, when exiting at n == 0, there are no tasks/results left in the channel, so neither the master nor the workers hold a reference to the channel, and the GC collector will reclaim it shortly after.
  3. Why does ConcurrentChannel need to use a goroutine to write a URL into the channel? Otherwise, the master will block forever when reading. And the channel is an unbuffered channel; if a goroutine is not used, it will block forever on the write.

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg