A collection of computer, gaming and general nerdy things.

Sunday, January 18, 2015

Self-Describing Functions

Self-Describing Functions

This post isn't about writing functions. It's not about what *args, **kwargs means, closures, decorators, callbacks, or def. It's about writing self-describing functions. That nebulous term that everyone knows they should be doing and has an idea of what it means, but no one really knows what it means.

Before I dive into this, I will point out that way smarter people than I have written entire books about this very thing. This is more a distillation of what I've found best works for me. Your experience and philosophies on software design may lead you down an alternative path.

Since functions are the most basic code block in Python, it's easiest to lay out my thoughts with them. From there, they're easily extrapolated to classes, modules and packages.

Describing Self-Describing

Since self-describing is like "synergy" or "webscale" where people have ideas and vague defitions of what it means to be "synergistic" or "webscale" but mostly it ends up sounding like a CutCo sales pitch.

To mean, self-describing means a number of things, but the most important bits are:

  • Well Named
  • Avoids "Ands"
  • Documentation
  • Clear, Concise Code

Well Named

There's a joke in metal circles that the late guitarist for Mayhem -- Euronymous -- chose his name because he thought it was a Greek demon, but actually it's Greek for "well named." I'm unsure of the validity of the rumor (It's all Greek to me), but I've always been amused by it.

The importance of choosing a good name for functions can't be overstated. Since functions take a snippet of code and turns it into a reusable piece, when you see that piece being used you need to know exactly what it's doing. For a minute, let's place ourselves as researchers in the field of Frobincated Mathematics. We spend all day calculating frobs. Notebooks, white boards and even our code bases are littered with the Frobincator Theorem:

In []:
frob = (a + b) * x

A simple, yet powerful formula in our every day work to find ever bigger and better frobs. However, we type it out a lot. One of our CompSci friends points out, "Hey, why don't you make that a function instead of typing it everywhere?" To which leads a debate on the differences between mathematical and programmtic functions that is neither here nor there. Eventually, we do and we end up with this:

In [1]:
def calc(a, b, x):
    return (a + b) * x

See any problems? I do. While we've taken a common piece of our code base and made it reusable, we've actually obfuscated our intent! calc? calc what? Pythag triples? Fibonnaci numbers? Our most precious frobs? If we had to get our CompSci buddy to help us debug an issue, he's gonna wonder exactly that. He'll either ask us what it means and does, or he'll run grep -ir "def calc" . to find it. Either way, we've done ourselves a great disservice. But instead, if we change it, ever so slightly, to this:

In [2]:
def calc_frob(a, b, x):
    ...

Then, everyone knows that it's calculating frobs! Sure, we may still need to track down the actual source code to see if it's causing problems, but there's no headscratching at what it does.

Currently, in one of my projects' code base, there is a function that isn't well named. It's called find. What does find find? Well...actually, it's more of a filter than a finder. Sure it uses os.walk to get file names but it doesn't really find things as it makes sure we only grab the correct file types (based on the fool-proof way of trusting the extension...). Had I named it filter_file_exts or sift_extensions, there wouldn't be an issue.

But sadly, I have learned the hard way, so you don't have to. Recently, I had to revisit some old PHP I wrote between two and four years ago -- it was a time of drunken code debauchery. Some day you'll write code that you'll need to revise two years -- or even two days -- later and you'll look at a function call and go "What in the world does \manager\extractAll even do?" Do yourself a favor now, and starting naming and renaming functions to better things. Even if naming things is hard

The same logic applies to variables -- inside and outside functions. In our calc_frob function, it makes sense what a, b and x are doing. But if you have a thirty line function, will you remember what x means by line fifteen? If you're dealing with interest on a loan, try naming it interest_percent or if that's too long percent if it's still clear.

There are short hand variables that most people will recognize, like fh for a file handle. And if you're opening (and closing!) a single file, then name it fh. It won't confuse me or you or anyone else involved. But, if you're opening two files -- maybe you're shuffling data between them -- then don't name them fh1 and fh2 but consider source_fh and target_fh.

Avoids "Ands"

By "ands" I of course don't mean things like boolean ands or if/elif/else blocks, rather I mean when you're describing the function you don't need to use the word "and" to get it's job across. "This functions gathers a list of file names AND filters them." Why not write a function that just filters the file names and pass it an iterable of file names?

This ties into a later point, but in that same project code base there's a function called "store_directory" -- which I consider well-ish named, since it makes sense in the context of it's module. This thing is 54 lines long and does way more than just stuff ID3 info into a database: outputs to the screen, times the operation, counts the number of files, massages ID3 info into a dictionary, converts the dictionary to data models, figures out if genres are being tagged, what genres to tag, who's tagging, AND stuffs the data into the database. To be a little fair to myself, some of that work (not much) is farmed out to other functions, but it's a mess of try-except-finally blocks, loops, conditionals. The actually important part of the function is so buried, when I get around to refactoring it, I'll probably write a Cliff's Notes comment about it and just :dG. Good bye horrible function!

Why? Because my application isn't a single monolothic function that does twelve different things. It's a series of cooperative pieces that work together to do something. If your coffee grinder also had a salad mixer built in, would you consider it a well designed coffee grinder? Who even has salad and coffee together? Maybe a fruit salad for breakfast, but still.

By contrast, there's a function called break_tag at the top of the file which has six lines of documentation and four examples for 1 line of code:

In [3]:
def break_tag(tag, breakers=('\\\\', '/', '&', ',', ' ',  '\\\.')):
    '''Breaks a composite tag into smaller pieces based on certain
    punctuation. Smaller tags are stripped for excess white space before
    being placed into a set.
    
    .. code-block:: python
        break_tag('viking / folk')  # -> {'viking', 'folk'}
        break_tag('dance & pop') # -> {'dance', 'pop'}
        break_tag('thrash metal') # -> {'thash', 'metal'}
        break_tag('progressive metal / black metal')
            # -> {'progressive', 'black', 'metal'}
    
    :param tag str: The tag to be analyzed and broken.
    :param breakers iterable: A group of characters to break larger tags with.
    :returns set:
    '''
    return set(t.strip() for t in re.split('|'.join(breakers), tag))

Is there a technical and in there? Sure, if you want to split hairs, it does break a larger tag down and strips white space off the smaller bits, but if you (like me) consider stripping the white space off to be breaking the bigger tag down, then there's not. Plus I didn't have to make up some obtuse phrase like "atomic tags."

Rather than writing functions that are described by "and," why not write pure functions and pass them data? Pure functions are wonderful: they don't have side effects, they do one thing, you can have deterministic results!

Think of the simple unit tests you can write! You know how I'd test break_tag? I'd give it three inputs that should do the right thing, and three inputs that should do the "wrong" thing and verify those. How would I test store_directory? ...I didn't, becuase by the time I started mocking and patching things to make it "testable" I'd written about ninety lines of code that was just setup. And then I threw it away and called it a failure.

Will you always avoid "ands?" No, because just like your application isn't a single god function, it also isn't a set of completely independent operations. You can't expect a black box with no entry point and no exit point to do anything. It won't even heat up when you turn it on because you can't turn it on. You'll have "ands" but they should appear where you're retrieving data from I/O, processing it in the pure functions and then spitting it back out to I/O. "Ands" should be used to wire pieces together.

Documentation

The first thing I do after I finishing writing def frob(a, b, x): is to immediately open a docstring. To me, docstrings are the function, they describe what a function does, what the different parameters are and might even provide example usage or doctests. Below the docstring is just a language specific implementation. (a + b) * x is obvious to the point where you might think, "Do I need a docstring?" The answer is yes.

In [4]:
def calc_frob(a, b, x):
    """Calculates a frob based on the Frobincator Theorem."""
    return (a + b) * x

In addition to giving it a concise name, we've also given more information to those that come behind us. Maybe they're unfamiliar with frobincated mathematics and don't know the theorem. Now they do.

Now, what I mean about language specific implementation is that even though we can write the same exact function with the same implementation in Haskell, someone could write it like this:

In []:
frob a b x = (*) x $ (+) a b

Which looks completely different but does the exact same thing. A simple comment or docstring above it stating, "This is the Frobincator Theorem" makes it's immediately clear what the function does.

Even better, Python makes it ridiculously easy to look at docstrings without having to open a text editor!

In [5]:
print(calc_frob.__doc__)
Calculates a frob based on the Frobincator Theorem.

You can do that in a script or in the shell. IPython does it one better with the ? and ?? line magicks. ? displays the argument signature (if available) and the docstring if it's present, along with a few more things. ?? will pull any Python source code (not C/Java/Whatever, just pure Python) available for the function (or class, object, module, etc).

Since Python makes it easy to get to the docstrings, then I can't think of a valid reason not to write them. And no, "I was lazy" isn't a valid excuse -- even if I've used it before.

Another important aspect of documentation is comments. There's people that will tell you if you have comments, then you've done something wrong. I completely disagree with that notion. Consider this field in a SQLAlchemy model:

In []:
_trackpositions = db.relationship(
    'TrackPosition',
    backref='tracklist',
    order_by='TrackPosition.position',
    collection_class=ordering_list('position'),
   )

Are you familiar with the collection_class keyword? Or ordering_list? I wasn't once and chances are, one day someone else who isn't will stumble across that field, scratch their head and let out an audible "WTF?" before either giving up or looking up the meaning. But with the original comment in place...

In []:
_trackpositions = db.relationship(
    'TrackPosition',
    backref='tracklist',
    # ordering_list will automatically update the position attribute
    # on the proxied Tracks. However, it must also be fed a correct
    # inital ordering.
    order_by='TrackPosition.position',
    collection_class=ordering_list('position'),
   )

...the basic intent is communicated. If docstrings are the definition of the function, then comments are just notes on the language specific implementation of it. When you're using someone else's code, you can't just change the name of something, nor does your intent always get cleanly communicated through the code. However, comments allow you to clarify what is meant. They're also helpful for noting what needs to be improved.

Just real quick, cd over to a project you have and run grep -ir todo . and see what pops out. I have a todo in another project that reads: "#TODO: cleanly handle multiple endpoints" in the context of JSONPath. I didn't want to spend two hours at the time writing the implementation that would muddy up an otherwise nice looking function and vim is ever so helpful and highlights it an really ugly yellow that I feel the compulsive urge to remove.

This doesn't mean you should place comments willy nilly all over the place, for example this comment isn't helpful at all:

In [6]:
def frob(a, b, x):
    #multiples x by the sum of a and b
    return (a + b) * x

But when you come across a piece of code you've written that maybe needs a little clarification on how it works, then toss a comment on there for mental health's sake.

However, documentation and comments are no reason not to write...

Clean, Concise Code

This is the final point and the least important of them to me. If you've named your function well, avoided doing "And-y" things and you've written a nice docstring and some comments, I can stomach poorly written code. I can reason about what it's supposed to do. However, I do ask you to do your best.

When you actually get around to writing code, remember Einstein's take on Occam's Razor: Everything should be made as simple as possible, but not simpler. I'm not actually sure if Einstein said that, but it's widely attributed to him. This is the approach we should take to writing code and it ties in greatly with the "Avoid And" principle and pure functions.

Consider building a basic Markov Chain with a dictionary and lists. There are three ways to ensure a list of words exists at a keyword. The most basic is this:

In [7]:
markov = {}

def add_entry(chain, state, new):
    if state not in chain:
        chain[state] = []
    chain[state].append(new)

However, Python dictionaries have a setdefault method which checks if the key exists already and if not, creates a default entry for it.

In [8]:
markov = {}

def add_entry(chain, state, new):
    chain.setdefault(state, []).append(new)

But by and afar my favorite way to handle this is with collections.defaultdict:

In [9]:
from collections import defaultdict

markov = defaultdict(list)

def add_entry(chain, state, new):
    chain[state].append(new)

I'd consider all three clean and concise, and all equally appropriate. However, as much as I love defaultdict the last version doesn't communicate the full intent that we're intending to use one. It makes things too simple, whereas the first two communicate that "Hey, this new state might not exist in the chain yet." Though in actual practice, it would depend on how self-contained the actual chains were is how you'd decide which method to us.

Parting Thoughts

Like I said, this stuff is easily extrapolated to classes all the way up to whole applications. And I will also admit that the avoiding ands is largely inspired by the Clean Architecture and Onion Architecture design patterns I've been reading about lately, and specifically from Brandon Rhodes's The Clean Architecture in Python presentation at PyOhio 2014.

If you've got some poorly named, undocumented or and-y functions in your projects, I urge you to go and look at them and see what can be changed to help your project. I know not every function name can be changed easily, especially if it shares a name with something or is a common word. But if you can change it easily and it improves your project, why not?