木鸟杂记

大规模数据系统

A Gentle Introduction to Python3 Generators

python-generator.pngpython-generator.png

Introduction

Once during an interview, I asked a candidate: What is a generator in Python? The answer was: A function with the yield keyword. But in my impression, the value returned by such a function is the generator, not the function itself. For example:

1
2
3
4
5
6
7
8
9
10
11
In [1]: def get_nums(n): 
...: for i in range(n):
...: yield i
...:
In [2]: type(get_nums)
Out[2]: function

In [3]: nums = get_nums(10)

In [4]: type(nums)
Out[4]: generator

However, seeing the candidate so confident, I had a vague feeling that something was off, which led to the following exploration.

Author: Muniao’s Notes https://www.qtmuniao.com, please indicate the source when reposting

Concepts

To clarify the distinctions among these concepts in Python3, the most authoritative source is of course the official Glossary:

generator

​ A function which returns a generator iterator. It looks like a normal function except that it contains yield expressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function.

​ Usually refers to a generator function, but may refer to a generator iterator in some contexts. In cases where the intended meaning isn’t clear, using the full terms avoids ambiguity.

generator iterator

​ An object created by a generator function.

​ Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator iterator resumes, it picks up where it left off (in contrast to functions which start fresh on every invocation).

generator expression

​ An expression that returns an iterator. It looks like a normal expression followed by a for clause defining a loop variable, range, and an optional if clause.

As we can see, there are mainly three concepts related to Python generators: generator, generator expression, and generator iterator. The glossary also specifically notes that generator may refer to a generator function in some contexts (as the candidate said), but in other contexts it refers to a generator iterator (as the interpreter told us in the example above). To avoid ambiguity, it is best to use the full terms when expressing these ideas.

Based on my experience, they can be summarized as follows:

  • Generator Function: A function containing the yield keyword that returns a series of values and can be iterated using next() on its return value.
  • Generator Iterator: The object returned by a generator function. It can be iterated over once.
  • Generator Expression: An expression that evaluates to a generator iterator. It is defined using parentheses and for, as shown in the following example:
1
2
3
4
In [5]: a = (i*i for i in range(10))                                                                                                                                                                                                                                     

In [6]: type(a)
Out[6]: generator

Going Deeper

Generator vs Iterator

The relationship may not be obvious from the Chinese translations, but looking at the English terms mentioned above—generator iterator—makes it clear at a glance:

Iterator is a broader concept, while generator (generator iterator) is a kind of iterator, but the converse is not true.

An iterator is any object that implements the __next__ method. It can be iterated using next(iterator), and raises a StopIteration exception when the iteration is complete.

1
2
3
4
5
6
7
while True:
try:
x = next(an_iterator)
do_sth_with(x)
except StopIteration:
break

Usually, we simplify the iteration using for in:

1
2
for x in an_iterator:
do_sth_with(x)

How yield Works

yield is a magical keyword. It temporarily suspends the current function, remembers its context (including local variables and pending try-except statements), and returns control to the caller. The next time the generator function is called, it resumes the saved context and continues executing the remaining statements until it hits another yield or exits.

The familiar return keyword is the counterpart to yield, but return ends the function call, destroys the context (pops the stack frame), and returns control to the caller.

Therefore, a function that uses yield for control flow is called a generator function, while a function that uses return for control flow is just an ordinary function~

Of course, because it can temporarily suspend the execution of a function, yield has a more advanced use case: acting as a bridge for interaction between its caller and the suspended function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [1]: def dynamic_step_seq(size, start=0, default_step=1): 
...: x = start
...: for _ in range(size):
...: given_step = yield x
...: if given_step is not None:
...: x += given_step
...: else:
...: x += default_step
...:

In [2]: for x in dynamic_step_seq(10):
...: print(x, end=' ')
...: print()
0 1 2 3 4 5 6 7 8 9
In [3]: dss = dynamic_step_seq(10)

In [4]: next(dss) # 注
Out[5]: 0

In [6]: dynamic_step = 1
...: while True:
...: try:
...: x = dss.send(dynamic_step)
...: dynamic_step += 1
...: print(x, end=' ')
...: except StopIteration:
...: print()
...: break
...:
1 3 6 10 15 21 28 36 45

Note: At first glance this may seem strange, but we can infer the principle from how yield works: you need to first call next to run the function up to the yield statement before you can send an object to the generator via generator.send.

Utility

So what are the benefits of using generators? In short, there are two main advantages:

  • Concise code
  • Improved performance

Concise Code

Using the yield keyword or generator expressions makes it very convenient to create an iterator object. To illustrate this, let’s compare different implementations for a simple requirement: getting the first n natural numbers.

The most intuitive approach is to build a list and return it:

1
2
3
4
5
6
7
8
9
# Build and return a list
def firstn(n):
num, nums = 0, []
while num < n:
nums.append(num)
num += 1
return nums

sum_of_first_n = sum(firstn(1000000))

When n is small, this implementation is fine, but when n becomes very large, your machine’s memory cannot handle it:

1
2
In [4]: a = firstn(10000000000000000000)                                                                                                                                                                                                                                 
Killed

IPython gets killed due to running out of memory.

So, we naturally think of using the generator pattern, but you still don’t want to use yield. Thus, you need to construct an object and implement the __iter__ and __next__ methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Using the generator pattern (an iterable)
class firstn(object):
def __init__(self, n):
self.n = n
self.num, self.nums = 0, []

def __iter__(self):
return self

def __next__(self):
if self.num < self.n:
cur, self.num = self.num, self.num+1
return cur
else:
raise StopIteration()

sum_of_first_n = sum(firstn(1000000))

Note: In Python3, if your class implements __iter__, it becomes an iterable object, and you can call iter(instance) to get an iterator. If your class implements __next__, then the class instance itself becomes an iterator, which can be called via next(instance) to iterate.

This code finally meets our requirement. However, it has the following problems:

  1. To implement a simple requirement, you have to write lengthy code.
  2. It uses many language conventions and is not intuitive to read.
  3. The logic is convoluted.

Finally, you can’t stand it anymore and use yield:

1
2
3
4
5
6
7
8
# a generator that yields items instead of returning a list
def firstn(n):
num = 0
while num < n:
yield num
num += 1

sum_of_first_n = sum(firstn(1000000))

With just a few lines, you have constructed a generator function.

In fact [grin], Python 3 already has this function: range(n).

Using this function along with iterator expressions, you can construct many iterable sequences with very compact code:

1
2
squares = (x * x for x in range(n))
evens = (2 * x for x in range(n))

Note that they can only be iterated once.

Improved Performance

This point is mainly about memory usage. Because iterators do not store all values, but dynamically compute each value of the sequence on the fly and discard previous values, they are very memory-efficient, as demonstrated in the previous example. Here is another more practical example.

Suppose we have a very large file (say, 8 GB), but your computer only has 4 GB of memory. How do you process it with Python?

The answer is to use yield to construct a generator iterator:

1
2
3
4
5
6
7
8
9
10
11
def read_by_chunks(file, chunk_size=1024):
while True:
data = file.read(chunk_size)
if not data:
break
yield data


f = open('your_big_file.dat')
for chunk in read_by_chunks(f):
process_chunk(chunk)

This method of reading a file chunk by chunk using a generator is also called lazy loading, or streaming reading, and is a common way to handle large files. If it is a text file and you can read it line by line, the code is even simpler:

1
2
3
with open('your_big_file.txt') as f:
for line in f:
process_line(line)

In Python3, a file handle is itself an iterator, which by default splits the file by lines for lazy loading.

References

  1. Python 3 Glossary: https://docs.python.org/3/glossary.html

  2. Python 3 wiki: https://wiki.python.org/moin/Generators

  3. Iterator vs Iterable: https://stackoverflow.com/questions/52056146/separating-the-iter-and-next-methods/52056290

  4. Lazy read big file: https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg