python-generator.png
Introduction
Once during an interview, I asked a candidate: What is a generator in Python? The answer was: A function with the yield keyword. But in my impression, the value returned by such a function is the generator, not the function itself. For example:
1 | In [1]: def get_nums(n): |
However, seeing the candidate so confident, I had a vague feeling that something was off, which led to the following exploration.
Author: Muniao’s Notes https://www.qtmuniao.com, please indicate the source when reposting
Concepts
To clarify the distinctions among these concepts in Python3, the most authoritative source is of course the official Glossary:
generator
A function which returns a generator iterator. It looks like a normal function except that it contains
yieldexpressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with thenext()function. Usually refers to a generator function, but may refer to a generator iterator in some contexts. In cases where the intended meaning isn’t clear, using the full terms avoids ambiguity.
generator iterator
An object created by a generator function.
Each
yieldtemporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator iterator resumes, it picks up where it left off (in contrast to functions which start fresh on every invocation).generator expression
An expression that returns an iterator. It looks like a normal expression followed by a
forclause defining a loop variable, range, and an optionalifclause.
As we can see, there are mainly three concepts related to Python generators: generator, generator expression, and generator iterator. The glossary also specifically notes that generator may refer to a generator function in some contexts (as the candidate said), but in other contexts it refers to a generator iterator (as the interpreter told us in the example above). To avoid ambiguity, it is best to use the full terms when expressing these ideas.
Based on my experience, they can be summarized as follows:
- Generator Function: A function containing the
yieldkeyword that returns a series of values and can be iterated usingnext()on its return value. - Generator Iterator: The object returned by a generator function. It can be iterated over once.
- Generator Expression: An expression that evaluates to a generator iterator. It is defined using parentheses and
for, as shown in the following example:
1 | In [5]: a = (i*i for i in range(10)) |
Going Deeper
Generator vs Iterator
The relationship may not be obvious from the Chinese translations, but looking at the English terms mentioned above—generator iterator—makes it clear at a glance:
Iterator is a broader concept, while generator (generator iterator) is a kind of iterator, but the converse is not true.
An iterator is any object that implements the __next__ method. It can be iterated using next(iterator), and raises a StopIteration exception when the iteration is complete.
1 | while True: |
Usually, we simplify the iteration using for in:
1 | for x in an_iterator: |
How yield Works
yield is a magical keyword. It temporarily suspends the current function, remembers its context (including local variables and pending try-except statements), and returns control to the caller. The next time the generator function is called, it resumes the saved context and continues executing the remaining statements until it hits another yield or exits.
The familiar return keyword is the counterpart to yield, but return ends the function call, destroys the context (pops the stack frame), and returns control to the caller.
Therefore, a function that uses yield for control flow is called a generator function, while a function that uses return for control flow is just an ordinary function~
Of course, because it can temporarily suspend the execution of a function, yield has a more advanced use case: acting as a bridge for interaction between its caller and the suspended function:
1 | In [1]: def dynamic_step_seq(size, start=0, default_step=1): |
Note: At first glance this may seem strange, but we can infer the principle from how yield works: you need to first call next to run the function up to the yield statement before you can send an object to the generator via generator.send.
Utility
So what are the benefits of using generators? In short, there are two main advantages:
- Concise code
- Improved performance
Concise Code
Using the yield keyword or generator expressions makes it very convenient to create an iterator object. To illustrate this, let’s compare different implementations for a simple requirement: getting the first n natural numbers.
The most intuitive approach is to build a list and return it:
1 | # Build and return a list |
When n is small, this implementation is fine, but when n becomes very large, your machine’s memory cannot handle it:
1 | In [4]: a = firstn(10000000000000000000) |
IPython gets killed due to running out of memory.
So, we naturally think of using the generator pattern, but you still don’t want to use yield. Thus, you need to construct an object and implement the __iter__ and __next__ methods:
1 | # Using the generator pattern (an iterable) |
Note: In Python3, if your class implements __iter__, it becomes an iterable object, and you can call iter(instance) to get an iterator. If your class implements __next__, then the class instance itself becomes an iterator, which can be called via next(instance) to iterate.
This code finally meets our requirement. However, it has the following problems:
- To implement a simple requirement, you have to write lengthy code.
- It uses many language conventions and is not intuitive to read.
- The logic is convoluted.
Finally, you can’t stand it anymore and use yield:
1 | # a generator that yields items instead of returning a list |
With just a few lines, you have constructed a generator function.
In fact [grin], Python 3 already has this function: range(n).
Using this function along with iterator expressions, you can construct many iterable sequences with very compact code:
1 | squares = (x * x for x in range(n)) |
Note that they can only be iterated once.
Improved Performance
This point is mainly about memory usage. Because iterators do not store all values, but dynamically compute each value of the sequence on the fly and discard previous values, they are very memory-efficient, as demonstrated in the previous example. Here is another more practical example.
Suppose we have a very large file (say, 8 GB), but your computer only has 4 GB of memory. How do you process it with Python?
The answer is to use yield to construct a generator iterator:
1 | def read_by_chunks(file, chunk_size=1024): |
This method of reading a file chunk by chunk using a generator is also called lazy loading, or streaming reading, and is a common way to handle large files. If it is a text file and you can read it line by line, the code is even simpler:
1 | with open('your_big_file.txt') as f: |
In Python3, a file handle is itself an iterator, which by default splits the file by lines for lazy loading.
References
-
Python 3 Glossary: https://docs.python.org/3/glossary.html
-
Python 3 wiki: https://wiki.python.org/moin/Generators
-
Iterator vs Iterable: https://stackoverflow.com/questions/52056146/separating-the-iter-and-next-methods/52056290
-
Lazy read big file: https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python
