A collection of computer, gaming and general nerdy things.

Wednesday, October 1, 2014

Iteration 2: Generators

This is a follow up post to the iteration post, which covered what I consider are the basics of iteration. This post covers generators, which are a special type of iter. I'm going to try to avoid, for the most part, using generators to calculate infinite streams of integers (Fibonacci, et al).

I also need to admit being a bit of a David Beazley fan boy. His tutorials, writings and talks on generators were extremely influential on my understanding of generators, their uses and composition. As a consequence, I tend to quote him a lot.

What makes a generator a generator

The biggest, most obvious difference between generators and other things that make iterators, is that generators yield values, not return values. Consider the "cannonical" implementation of the Fibonacci sequence:

In [1]:
def gen_fib():
    a, b = 0, 1
    yield a
    yield b
    while True:
        yield a+b
        a, b = b, a+b

The way a lot of Fibonacci sequences are implemented are with a list, essentially saying, "I want the first ten digits of Fibonacci"

In [2]:
def list_fib(n):
    fibs = []
    if n < 1:
        pass
    elif n == 1:
        fibs.append(0)
    elif n == 2:
        fibs.extend([0,1])
    else:
        a, b = 0, 1
        fibs.extend([a,b])
        while len(fibs) < n:
            fibs.append(a+b)
            a, b = b, a+b
    return fibs

print(list_fib(0))
print(list_fib(1))
print(list_fib(2))
print(list_fib(10))
[]
[0]
[0, 1]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

list_fib is a mess, we have to check for what was passed in, we need to monitor the size of the list (even using collections.deque doesn't quite solve this problem). There's a while loop we might hit. But all of this is needed to make sure we correctly construct the list of Fibonacci numbers.

By contrast, the gen_fib function is simple, clean, there's only one form of flow control. Well, it looks like there's only one form of flow control, but the logic inside of it is much more complicated.

Under the hood

So what happens when we call gen_fib?

In [3]:
print(gen_fib())
<generator object gen_fib at 0x7f61c427cb88>

A generator object pops out. Odd. I don't see any object instantiation going on inside of the function. Let's talk a closer look using the inspect modules, in particular there are two functions in there of intrest: isgenerator and isgeneratorfunction. Of course, there's also the usual tools of type and hasattr.

In [4]:
from inspect import isgenerator, isgeneratorfunction

f = gen_fib()

print("isgeneratorfunction(gen_fib):", isgeneratorfunction(gen_fib))
print("isgenerator(f):", isgenerator(gen_fib()))
print("type(f):", type(f))
print("type(iter(gen_fib())):", type(iter(f)))
print("Are generators their own iterators?", f is iter(f))
print('f hasattr __iter__:', hasattr(f, '__iter__'))
print('f hasattr __next__:', hasattr(f, '__next__'))
isgeneratorfunction(gen_fib): True
isgenerator(f): True
type(f): <class 'generator'>
type(iter(gen_fib())): <class 'generator'>
Are generators their own iterators? True
f hasattr __iter__: True
f hasattr __next__: True

Under the hood, when Python sees yield inside of a function, it converts that function into a generator class using the logic inside of it. Calling gen_fib is very similar to what happens when you call MyFibClass, out pops an object. The actual implementation of generators is beyond the scope of this, however.

Actually using generators

So, now we have this object, how can we get values out of it? The obvious answer is iteration! However, if you notice gen_fib's while loop never exits, it's infinite. Attempting to consume the "whole" sequence will exhaust time (but honestly, it'll probably consume all your memory first, Fibonacci numbers get really big really quick). But just like with other iterators, next can be used to manually pull values of it.

In [5]:
f = gen_fib()
print(next(f))
print(next(f))
print(next(f))
print(next(f))
0
1
1
2

However, I mentioned that the flow control is actually more complicated than list_fib. Here's why and where the biggest difference between yield and return become known:

  • Assign the result of gen_fib() (a generator object) to f
  • Call next on f and print the value
    • This runs the code down until the first yield statement which is:
    • Assign 0 to a and 1 to b
    • yield a (which is 0)
  • Call next on f and print the value
    • yield b (which is 1)
  • Call next on f and print the value
    • create a while loop on a true conditional
    • yield a+b (1)
  • Call next on f and print the value
    • assign b to a, assign a+b to b (this happens via tuple unpacking)
    • while loop condition is still true
    • yield a+b (2)

If it's not obvious, a generator function is a function that "pauses" it's execution when it hit yield which also causes it spit out a value. Regular functions get one chance to run and return a value: they take some arguments, do something and then return a value. A generator takes some arguments, does something, yields a value and waits around to be called again. By not building a list that is potentially massive, generators can be incredibly memory efficient.

This opens up a lot of possibilites:

  • Creating infinite sequences (perhaps the most mundane application, although incredibly handy)
  • Processing a large amount of data.
  • Creating a long sequence that you might not need all the values from (say I have a function that returns a list of twenty items but I only need the first ten)
  • Creating a long sequence where you don't need all the data at once (memory concerns, etc)
  • And others!

And what's really cool, if you have a simple generator -- something like "for each of these lines, yield one if it meets this simple requirement" -- you can write a generator expression, here's a very contrived example.

In [6]:
nums = (num for num in range(1, 100) if not num%3)
print(list(nums))
[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99]

Gotchas

The biggest most important gotcha is that generators are forgetful. What I mean by that, is when a generator hits yield it sends a value out. Unless the internal state that created value is met again, you can't get it back from the generator. Unlike lists and other types of iterables where you can iterate over the same object multiple times, generators are one time use. You can create multiple generator objects from the same function and iterate over each (and each will maintain their own state).

As a consequence of this, searching a generator with in or by explicitly hitting __contains__ will partially or wholly consume a generator. This is because in asks the generator, "Hey, do you ever return this value?" The generator gets to work yielding values out until one matches. All those values it outputs are gone. I suppose this could potentially be helpful in some situations, but I do want this caveat to be known.

Another thing that will catch people off guard the first few times is that generators don't have a __len__ property. Essentially, a generator has no idea how many values it will yield. It can't tell you that. Generators are also non-indexable for the same reason.

Delegating

Up until now, these examples have been compatible between Python 2 and 3. However, in 3.3 a few changes were made to yield that took it from awesome to amazing. yield gained an optional keyword from. yield from delegates access from the current generator to one it's calling. The simplest way to understanding this is to know the next two code snippets output the same thing:

In [7]:
def my_chain_1(*iters):
    '''This is actually the example code in the 
    itertools module for chain
    '''
    for it in iters:
        for item in it:
            yield item

def my_chain_2(*iters):
    '''This example completely removes a loop
    '''
    for it in iters:
        yield from it

a = range(2)
b = range(5,7)

print(list(my_chain_1(a,b)))
print(list(my_chain_2(a,b)))
[0, 1, 5, 6]
[0, 1, 5, 6]

The true power of yield from is that when you call my_chain_2 you aren't simply being fed values from a inner generator. You are interacting directly with the inner generator. The impact of this is profound. However, you don't need to construct an event loop to make use of this.

Real world use

My canonical example is walking my ~/Music directory and doing something with...

In [8]:
%%bash
find ~/Music -type f -name '*.mp3' | wc -l
18868

...that number of files. To be honest, I'm not concerned about creating a list in memory of 19k file paths (typically a one time operation that I run every now and then). What I'm concerned with is processing 19k file paths in a timely fashion, let alone opening up the files, pulling information out of them and handling them. For the time being, I'm only going to operate on a very small subset of my library. I'm also going to show off how to build "generator pipelines" as Dave Beazley calls them.

I do use a third party library here, mutagenx a Python3 implementation of mutagen.

In [9]:
import os

from pprint import pprint

from mutagenx import File

valid_types=('m4a', 'flac', 'mp3', 'ogg', 'oga')

def find(basedir, valid_types=valid_types):
    '''Utilize os.walk to only select out the files we'd like to potentially
    parse and yield them one at a time.'''
    basedir = os.path.abspath(basedir)
    for current, dirs, files in os.walk(basedir):
        files = filter(lambda f: f.endswith(valid_types), sorted(files))
        files = [os.path.join(current, f) for f in files]

        if files:
            yield from files

def adaptor(track):
    '''Take in a Mutagen track object and
    parse it into a dictionary.
    '''
    return dict(
        artist=track['artist'][0],
        album=track['album'][0],
        position=int(track['tracknumber'][0].split('/')[0]),
        length=int(track.info.length),
        location=track.filename,
        name=track['title'][0],
        )

def process_directory(basedir, valid_types=valid_types):
    '''Hook up find and adaptor into a pipeline'''
    
    files = find(basedir, valid_types)
    tracks = (File(f, easy=True) for f in files)
    yield from (adaptor(t) for t in tracks)

tracks = process_directory('/home/justanr//Music/At the Drive-In/Relationship of Command')
pprint(next(tracks))
{'album': 'Relationship of Command',
 'artist': 'At the Drive‐In',
 'length': 177,
 'location': '/home/justanr/Music/At the Drive-In/Relationship of Command/01 '
             'Arcarsenal.mp3',
 'name': 'Arcarsenal',
 'position': 1}

Because of the nature of generators, only one item gets pulled down the line at a time. So, instead of processing the whole directory at once, we can process the directory one file at a time.

yield and yield from also make processing trees easy as well. In the post on iteration, I showed a code example from the Python Cookbook that used a very complex iterator to maintain the state of iterating depth first over a tree. This is the same class built with generators:

In [10]:
class Node:
    def __init__(self, value):
        self._value = value
        self._children = []

    def __repr__(self):
        return 'Node({!r})'.format(self._value)

    def add_child(self, node):
        self._children.append(node)

    def __iter__(self):
        return iter(self._children)

    def depth_first(self):
        yield self
        for c in self:
            yield from c.depth_first()
            
root = Node(0)
child1 = Node(1)
child2 = Node(2)
root.add_child(child1)
root.add_child(child2)
child1.add_child(Node(3))
child1.add_child(Node(4))
child2.add_child(Node(5))

for ch in root.depth_first():
    print(ch)
Node(0)
Node(1)
Node(3)
Node(4)
Node(2)
Node(5)

The entire Node class is 18 lines. The DepthFirstIterator alone is 30 lines. The logic is less complex.

And of course, you can do bad things, too.

In [11]:
from itertools import chain

class IterInt(int):
    
    def __iter__(self):
        yield self

def countdown(n):
    n = IterInt(n)
    if n < 0:
        # wait... why are we returning here?
        # I thought generators never returned!
        return "Countdown finished"
    else:
        yield from chain(n, countdown(n-1))


print(list(countdown(10)))
try:
    next(countdown(-1))
except StopIteration as wut:
    # no seriously...wut
    print(wut)
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Countdown finished

Consider it incentive for me to write more about generators.

Further Reading

Generators are a really broad topic. For something that can be implemented for creating infinite sequences to being the backbone of event loops, there's quite a bit of ground to cover. There's a lot of really smart people out there writing things about them.

No comments:

Post a Comment