A collection of computer, gaming and general nerdy things.

Monday, September 29, 2014

Describing Descriptors

An Aside: Just like my post on decorators, I've decided to rewrite this post as well because it suffered from the same issue: "Look at all this code...and hey, there's explainations as well." Instead of exploring patterns, like I did in the decorator post, I'm going to focus in on one example use that explores several aspects of descriptors all at once. Of course, I'll step through it piece by piece. There's actually going to be two major sections to this post:

  • Python Object Attribute Access
  • Writing Our First Descriptor

Also, this post was built with Python 3.4 in mind. While it's foolish to think that everyone everywhere is using the latest and greatest Python release, it's what I've been using primarily lately. That's me.

Updated Nov. 8th, 2014: Added concrete examples of behind the scenes action of descriptors as well as a brief explaination of what __delete__ does.

Controlling Attribute Access

Before digging into descriptors, it's important to talk about attribute access. Because at the end of the day, that's what descriptors do for us. There's really two ways of going about this with explicitly building our own descriptor.

Getters and Setters

This is what you'll see in many languages: explicit getters and setters. They're methods that handle attributes for us. This is very common in Java and PHP (or at least as of the last time I seriously used PHP). Essentially, the idea is to always expect to interact with a method instead of an attribute itself. There's nothing wrong with this if it's what your language of choice supports and you need to control access.

In [1]:
# it's a contrived example
# but bear with me here
# pretend this is *important business logic*
def to_lower(value):
    return value.lower()

class Person:
    def __init__(self, name):
        self.__name = None
        self.set_name(name)
    
    def get_name(self):
        return self.__name
    
    def set_name(self, name):
        self.__name = to_lower(name)

monty = Person(name="John Cleese")
print(monty.get_name())
monty.set_name("Eric Idle")
print(monty.get_name())
john cleese
eric idle

That's fine and dandy. If that's what you language supports. Python, in my opinion, handles this better.

@property

The real way you'd write this in Python is by using @property. Which is a decorator, which we are familiar with. I won't go into details about what's going on behind the scenes yet.

In [2]:
class Person:
    
    def __init__(self, name):
        self.__name = None
        self.name = name
    
    @property
    def name(self):
        return self.__name
    
    @name.setter
    def name(self, name):
        self.__name = to_lower(name)

monty = Person(name="Graham Chapman")
print(monty.name)
monty.name = "Terry Gilliam"
print(monty.name)
graham chapman
terry gilliam

For now, don't worry that I have two methods called name. It'll become apparent in a little bit.

That is a much cleaner interface to the class. As far as the calling code is concerned, name is just another attribute. This is fantastic if you designed an object and later realized that you need to control how an attribute is returned or set (or deleted, but I'm not going to delve into that aspect of descriptors at all in this post).

To many people, this is where they'd stop with controlling attribute access. And frankly, I don't really blame them. @property seems to be magic enough behind the scenes already. Why is it disappearing into name, where does setter come from, how does it work? These are questions that might go unasked or, worse: unanswered.

@property suffices in most situations, especially when you only need to control one attribute in a specific way. But imagine if we had to ensure two or three attributes were all lower case? You might be tempted to replicate the code all the way down. You don't mind a little repitition, do you?

In [3]:
class Person:
    
    def __init__(self, email, firstname, lastname):
        self.__f_name = None
        self.__l_name = None
        self.__email = None
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
    
    @property
    def f_name(self):
        return self.__f_name
    
    @f_name.setter
    def f_name(self, value):
        self.__f_name = to_lower(value)

    @property
    def l_name(self):
        return self.__l_name
    
    @l_name.setter
    def l_name(self, value):
        self.__l_name = to_lower(value)

    @property
    def email(self):
        return self.__email
    
    @email.setter
    def email(self, value):
        self.__email = to_lower(value)

monty = Person(firstname='Michael', lastname='Palin', email='MichaelPalin@montypython.com')
print(monty.f_name, monty.l_name, monty.email)
michael palin michaelpalin@montypython.com

Like I said, it's a contrived example. But instead of ensuring things are lower cased, imagine you're attempting to keep text fields in a GUI synchronized with the state of an object or you're working with a database ORM. Things will very quickly get out of hand if you have to property a ton of stuff with the logic repeated except for the names.

Behind the Scenes

I'm not going to 100% faithfully recreate @property here, frankly I don't see a point. But I do want to dissect it from what we can observe on the surface.

  • property is a decorator. As a decorator it's a function that takes a function and returns a callable. This we know.
  • The callable created by property is an object
  • The object has at least one method on it, setter, that somehow controls how a variable is set

However, if we inspect property (my preferred way is with IPython's ? and ?? magics) we learn that there is one more method and three attributes that aren't immediately obvious to us.

  • deleter is the missing method, which handles how an attribute is deleted with del
  • fget, fset and fdel are the attributes, which are unsurprisingly the original functions for getting, setting and deleting attributes.

For the above example, fget and fset are our two name methods above. They actually get hidden away into an object decorator, which is how we have two methods with the same name without worry.

Data model

This about as far as we can get without understanding how Python accesses attributes on objects. I won't attempt to give a complete in depth analysis of how Python actually accesses and sets attributes on objects, but this is a simplified, high level view of what's going on (ignoring the existence of __slots__, which actually replaces the underlying __dict__ with a set of descriptors and what's going on there I'm not 100% sure of).

  1. Call __getattribute__
  2. Look up in the object's dictionary
  3. Look up in the class's dictionary
  4. Walk the MRO and look up in those dictionaries
  5. If the name resolves to a descriptor, return the value of it's __get__ method
  6. If all other look ups have failed and __getattr__ is present, call that method
  7. If all all else fails, raise an AttributeError

That fifth point is the most pertinent to us. An object that defines a __get__ method is known as a descriptor. property is actually an object that does this. In effect, it's __get__ method resembles this:

In [4]:
def __get__(self, instance, type=None):
    return self.fget(instance)

Similarly, the resolution for setting an attribute looks like this:

  1. If present, call __setattr__
  2. If the name resolves to a descriptor, call it's __set__ method
  3. Stuff the value into the object's dict

So, property's __set__ looks like this:

In [5]:
def __set__(self, instance, value):
    self.fset(instance, value)

I'm glossing over raising attribute errors for attributes that don't support reading or writing, but property does that. There's two reasons I'm positive this is how these two methods look is because I'm familar with the descriptor protocol and Raymond Hettinger wrote about it here.

Descriptor Protocol

There's three parts to the descriptor protocol: __get__, __set__ and __delete__. Like I said, I'm not delving into deleting attributes here, so we'll remain unconcerned with that. But any object that defines at least one of these methods is a descriptor.

The best way to dissect a descriptor is provide an example implementation. This example isn't actually going to do anything except emulate regular attribute access.

In [6]:
class Descriptor:
    
    def __init__(self, name):
        self.name = name
    
    def __get__(self, instance, cls):
        return instance.__dict__.get(self.name, None)
    
    def __set__(self, instance, value):
        instance.__dict__[self.name] = value
        
    def __delete__(self, instance):
        del instance.__dict__[self.name]

class Thing:
    frob = Descriptor(name='frob')
    
    def __init__(self, frob):
        self.frob = frob

t = Thing(frob=4)
print(t.frob)
4

get

def __get__(self, instance, type):

  • self is the instance of the descriptor itself, just like any other object.
  • instance is an instance of the class it's attached to
  • type is the actual object that it's attached to, I typically prefer to use cls as the name here because it's slightly more clear to me. owner is another common name, but slightly more confusing to me.

Above, when we request t.frob what's actually happening behind the scenes is Python is calling Thing.frob.__get__(t, Thing) instead of passing Thing.frob.__get__ directly to the print function. The reason the actual class is passed as well is "to give you information about what object the descriptor is part of", to quote Chris Beaumont. While I've not made use of inspecting which class the descriptor is part of, this could be valuable information.

You could also call Thing.frob.__get__(t, Thing) explicitly if you'd like, but Python's data model will handle this for us.

set

def __set__(self, instance, value):

  • self again no surprises here, this is the instance of the descriptor
  • instance this is an instance of the class it's attached to
  • value is the value you're passing in, if you've used property before, there's no surprise here.

Again, what's happening behind the scenes when we set t.frob to something (in this case, just in Thing.__init__), Python passes information to Thing.frob.__set__, the information just being the instance of Thing and the value we're setting.

delete

def __delete__(self, instance):

No surprises here. And despite that I said I wasn't going to go into deleting attributes with descriptors, I've included it for completion's sake. The delete method handles what happens when we call del t.frob.

Data vs Non-Data Descriptor

This is something you're going to encounter when reading about and working with descriptors: the difference between a data and non-data descriptor and how Python treats both when looking up an attribute.

Data Descriptor

A data descriptor is a descriptor that defines both a __get__ and __set__ method. These descriptors recieve higher priority if Python finds a descriptor and a __dict__ entry for the attribute being looked up. Already, you can see that attribute access isn't as clear cut as we thought it was.

Non-Data Descriptor

A non-data descriptor is a descriptor that defines only a __get__ method. These descriptors recieve a lower priority if Python finds both the descriptor and a __dict__ entry.

What does this mean?

By using descriptors we can create reusable properties, as Chris Beaumont calls them and I find to be an incredibly apt definition. But there's quite a few pits we can fall into. For the rest of this post, I'm going to focus on rebuilding our lower case properties as a reusable descriptor. In another post, I'm going to more fully explore some of the power these put at our finger tips.

Our First Descriptor

So far, we know a descriptor needs to define at least one of __get__, __set__, or __delete__. Let's try our hand at building a LowerString descriptor.

In [7]:
class LowerString:
    
    def __init__(self, value=None):
        self.value = value
    
    def __get__(self, instance, cls):
        return self.value
    
    def __set__(self, instance, value):
        self.value = to_lower(value)

class Person:
    f_name = LowerString()
    l_name = LowerString()
    email = LowerString()
    
    def __init__(self, firstname, lastname, email):
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
        
monty = Person(firstname="Terry", lastname="Jones", email="TerryJONES@montyPython.com")
print(monty.f_name, monty.l_name, monty.email)
terry jones terryjones@montypython.com

While this isn't as perfectly clean as we might like, it's certainly a lot prettier than using a series of property decorators and way nicer than defining explicit getters and setters. However, there's a big issue here. If you can't spot it, I'll point it out.

In [8]:
me = Person(firstname="Alec", lastname="Reiter", email="alecreiter@fake.com")
print(me.f_name, me.l_name, me.email)
print(monty.f_name, monty.l_name, monty.email)
alec reiter alecreiter@fake.com
alec reiter alecreiter@fake.com

...oh. Well that happened. And the reason for this, and this what tripped me up when I began reading about descriptors, is that each instance of person shares the same instances of LowerString for the three properties. Descriptors enforce a shared state by virture of being instances attached to a class rather than instances. So instead of composing an instance of an object with other objects (say a Person object composed of Job, Nationality and Gender instances), we compose a class out of object instances.

If we examine the __dict__ for both the class and the instance, it becomes apparent where Python finds these values at:

In [9]:
print(Person.__dict__)
print(monty.__dict__)
{'__doc__': None, '__weakref__': <attribute '__weakref__' of 'Person' objects>, '__init__': <function Person.__init__ at 0x7f337e2e58c8>, 'f_name': <__main__.LowerString object at 0x7f337ce4a5c0>, 'l_name': <__main__.LowerString object at 0x7f337ce4a550>, 'email': <__main__.LowerString object at 0x7f337ce4a630>, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Person' objects>}
{}

Since the descriptors aren't attached at the instance level, Python moves up to the class level where it finds the attribute we're requesting, sees it's a descriptor and then calls the __get__ method.

If you attempt to attach these descriptors at the instance level instead, you end up with this:

In [10]:
class Person:
  
    def __init__(self, firstname, lastname, email):
        self.f_name = LowerString(firstname)
        self.l_name = LowerString(lastname)
        self.email = LowerString(email)
        
me = Person(firstname="Alec", lastname="Reiter", email="alecreiter@fake.com")
print(me.f_name, me.l_name, me.email)
<__main__.LowerString object at 0x7f337ce4e390> <__main__.LowerString object at 0x7f337ce4e400> <__main__.LowerString object at 0x7f337ce4e438>

So explicitly attaching them to the instances won't work. But remember, Python passes the instance for us automatically. Let's try storing the value on the underlying object by accessing it's __dict__ attribute:

In [11]:
class LowerString:
    
    def __init__(self, label):
        self.label = label
    
    def __get__(self, instance, cls):
        return instance.__dict__[self.label]
    
    def __set__(self, instance, value):
        instance.__dict__[self.label] = to_lower(value)
        
class Person:
    f_name = LowerString('f_name')
    l_name = LowerString('l_name')
    email = LowerString('email')
    
    def __init__(self, firstname, lastname, email):
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
        
monty = Person(firstname="Carol", lastname="Cleaveland", email="seventh@montypython.com")
print(monty.f_name, monty.l_name, monty.email)
carol cleaveland seventh@montypython.com

And surely this works, but we've run into the issue of repeating ourselves again. It'd be nice if we could simply do something to automatically fill in the label for us. David Beazley addressed this problem in the 3rd Edition of the Python Cookbook.

In [12]:
class checkedmeta(type):
    def __new__(cls, clsname, bases, methods):
        # Attach attribute names to the descriptors
        for key, value in methods.items():
            if isinstance(value, Descriptor):
                value.name = key
        return type.__new__(cls, clsname, bases, methods)

Of course this means, we need to make two small changes to our descriptor: changing label to name and inheriting from a base Descriptor class.

In [13]:
class Descriptor:
    
    def __init__(self, name=None):
        self.name = name

class LowerString(Descriptor):
    
    def __get__(self, instance, cls=None):
        return instance.__dict__[self.name]
    
    def __set__(self, instance, value):
        instance.__dict__[self.name] = to_lower(value)
        
class Person(metaclass=checkedmeta):
    f_name = LowerString()
    l_name = LowerString()
    email = LowerString()
    
    def __init__(self, firstname, lastname, email):
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
        
monty = Person(firstname="Carol", lastname="Cleaveland", email="seventh@montypython.com")
print(monty.f_name, monty.l_name, monty.email)
carol cleaveland seventh@montypython.com

And this is very nice and handy. If later, we wanted to create an EmailValidator descriptor, so long as we adhere to the pattern laid out here, we can attach them to any class that uses the checkedmeta metaclass and it'll behave as expected.

But there's something still very annoying going on and it's one of the biggest gripes with property is that a getter has to be defined even if I'm only interested in the setter. If you set fget to None, you end up getting an attribute error that says it's write only. If we examine our current implementation, we'll notice something else as well:

In [14]:
print(Person.__dict__)
print(monty.__dict__)
{'__doc__': None, '__weakref__': <attribute '__weakref__' of 'Person' objects>, '__init__': <function Person.__init__ at 0x7f337ce38c80>, 'f_name': <__main__.LowerString object at 0x7f337ce31588>, 'l_name': <__main__.LowerString object at 0x7f337ce31550>, 'email': <__main__.LowerString object at 0x7f337ce316a0>, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Person' objects>}
{'email': 'seventh@montypython.com', 'f_name': 'carol', 'l_name': 'cleaveland'}

There's now the descriptors living at the class level and the values living at the instance level. Let's add some "debugging" print calls to see what's happening on the inside.

In [15]:
class LowerString(Descriptor):
    
    def __init__(self, name=None):
        self.name = name
    
    def __get__(self, instance, cls=None):
        print("Calling LowerString.__get__")
        return instance.__dict__[self.name]
    
    def __set__(self, instance, value):
        print("Calling LowerString.__set__")
        instance.__dict__[self.name] = to_lower(value)
        
class Person(metaclass=checkedmeta):
    f_name = LowerString()
    l_name = LowerString()
    email = LowerString()
    
    def __init__(self, firstname, lastname, email):
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
        
monty = Person(firstname="Carol", lastname="Cleaveland", email="seventh@montypython.com")
print(monty.f_name, monty.l_name, monty.email)
Calling LowerString.__set__
Calling LowerString.__set__
Calling LowerString.__set__
Calling LowerString.__get__
Calling LowerString.__get__
Calling LowerString.__get__
carol cleaveland seventh@montypython.com

Python gives special preference to data descriptors (as described before). However, we can remove this special preference by simply removing the __get__ method. Arguably, this is the most useless part of this descriptor anyways, it's not transforming the result or providing a lazy calculation, it's simply aping what Person.__getattribute__ would do in the first place: Find the value in the object's dictionary. If we remove this, then we're left with only a setter, which is what we really wanted in the first place:

In [16]:
class LowerString(Descriptor):
    
    def __init__(self, name=None):
        self.name = name
    
    def __set__(self, instance, value):
        print("Calling LowerString.__set__")
        instance.__dict__[self.name] = to_lower(value)
        
class Person(metaclass=checkedmeta):
    f_name = LowerString()
    l_name = LowerString()
    email = LowerString()
    
    def __init__(self, firstname, lastname, email):
        self.f_name = firstname
        self.l_name = lastname
        self.email = email
        
monty = Person(firstname="Carol", lastname="Cleaveland", email="seventh@montypython.com")
print(monty.f_name, monty.l_name, monty.email)
monty.f_name = "Cheryl"
print(monty.f_name)
Calling LowerString.__set__
Calling LowerString.__set__
Calling LowerString.__set__
carol cleaveland seventh@montypython.com
Calling LowerString.__set__
cheryl

And this has to do with how Python sets attributes, which examined above. Again, it's giving special precedence to the descriptor, which is what we want in the first place. However, when we access the attribute, it sees there's an entry in the object's __dict__ and the class's __dict__ but the latter doesn't have a __get__ method, which causes it to default back to the object's entry.

In [17]:
print(Person.__dict__)
print(monty.__dict__)
{'__doc__': None, '__weakref__': <attribute '__weakref__' of 'Person' objects>, '__init__': <function Person.__init__ at 0x7f337ce43730>, 'f_name': <__main__.LowerString object at 0x7f337ce55710>, 'l_name': <__main__.LowerString object at 0x7f337ce55748>, 'email': <__main__.LowerString object at 0x7f337ce55780>, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Person' objects>}
{'email': 'seventh@montypython.com', 'f_name': 'cheryl', 'l_name': 'cleaveland'}

Leaving us with just the setter and no unnecessary data or method duplication.

Going forward

In the original post, I also explored building an oberserver pattern with descriptors, something Chris Beaumont also touches upon briefly but leaves a lot on the table as far registering callbacks on every instance and specific instances of classes. I plan on touching on this again in a future post.

But for now, I'm hoping this leaves a much better impression of descriptors than my original post. Again, this isn't meant to be a tell all about descriptors but hopefully serves to clarify a lot of the magic that appears to happen behind the scenes when you're using SQLAlchemy and defining models.

Further Learning

No comments:

Post a Comment