Adventures in Manipulating Python ASTs

A while back, I explored the possibility of simplifying ¹ PyMC4’s model specification API by manipulating the Python abstract syntax tree (AST) of the model code. The PyMC developers didn’t end up pursuing those API changes any further, but not until I had the chance to learn a lot about Python ASTs.

Enough curious people have asked me about my experience tinkering with ASTs that I figure I’d write a short post about the details of my project, in the hope that someone else will find it useful.

You should read this blog post as a quick overview of my experience with Python ASTs, or an annotated list of links, and not a comprehensive tutorial on model specification APIs or Python ASTs. For a full paper trail of my adventures with Python ASTs, check out my notebooks on GitHub.

The Problem

Originally, PyMC4’s proposed model specification API looked something like this:

The main drawback to this API was that the yield keyword was confusing. Many users don’t really understand Python generators, and those who do might only understand yield as a drop-in replacement for return (that is, they might understand what it means for a function to end in yield foo, but would be uncomfortable with bar = yield foo).

Furthermore, the yield keyword introduces a leaky abstraction²: users don’t care about whether model is a function or a generator, and they shouldn’t need to. More generally, users shouldn’t have to know anything about how PyMC works in order to use it: ideally, the only thing users would need to think about would be their data and their model. Having to graft several yield keywords into their code is a fairly big intrusion in that respect.

Finally, this model specification API is essentially moving the problem off of our plates and onto our users. The entire point of the PyMC project is to provide a friendly and easy-to-use interface for Bayesian modelling.

To enumerate the problem further, we wanted to:

Hide the yield keyword from the user-facing model specification API.
Obtain the user-defined model as a generator.

The main difficulty with the first goal is that as soon as we remove yield from the model function, it is no longer a generator. However, the PyMC inference engine needs the model as a generator, since this allows us to interrupt the control flow of the model at various points to do certain things:

Manage random variable names.
Perform sampling.
Other arbitrary PyMC magic that I’m truthfully not familiar with.

In short, the user writes their model as a function, but we require the model as a generator.

I opine on why this problem is challenging a lot more here.

The Solution

First, I wrote a FunctionToGenerator class:

Subclassing ast.NodeTransformer (as FunctionToGenerator does) is the recommended way of modifying ASTs. The functionality of FunctionToGenerator is pretty well described by the docstring: the visit_Assign method adds the yield keyword to all assignments by wrapping the visited Assign node within a Yield node. The visit_FunctionDef method removes the decorator and renames the function to _pm_compiled_model_generator. All told, after the NodeTransformer is done with the AST, we have one function, _pm_compiled_model_generator, which is a modified version of the user-defined function.

Second, the Model class:

This class isn’t meant to be instantiated: rather, it’s meant to be used as a Python decorator. Essentially, it “uncompiles” the function to get the Python source code of the function. This source code is then passed to the parse_snippet³ function, which returns the AST for the function. We then modify this AST with the FunctionToGenerator class that we defined above. Finally, we recompile this AST and execute it. Recall that executing this recompiled AST defines a new function called _pm_compiled_model_generator. This new function, accessed via the locals variable⁴, is then bound to the class’s self.model_generator, which explains the confusing-looking line 25.

Finally, the user facing API looks like this:

As you can see, the users need not write yield while specifying their models, and the PyMC inference engine can now simply call the model_generator method of linear_regression to produce a generator called _pm_compiled_model_generator, as desired. Success!

Lessons Learnt

Again, PyMC4’s model specification API will not be incorporating these changes: the PyMC developers have since decided that the yield keyword is the most elegant (but not necessarily the easiest) way for users to specify statistical models. This post is just meant to summarize the lessons learnt while pursuing this line of inquiry.

Reading and parsing the AST is perfectly safe: that’s basically just a form of code introspection, which is totally a valid thing to do! It’s when you want to modify or even rewrite the AST that things start getting ~~janky~~ dangerous (especially if you want to execute the modified AST instead of the written code, as I was trying to do!).

If you want to programmatically modify the AST (e.g. “insert a yield keyword in front of every assignment of a TensorFlow Distribution”, as in our case), stop and consider if you’re attempting to modify the semantics of the written code, and if you’re sure that that’s a good idea (e.g. the yield keywords in the code mean something, and remove those keywords changes the apparent semantics of the code).

Want to hear more from me?

Subscribe to my newsletter! My thoughts on what I'm reading and learning, delivered once a month.
More information here. Newsletter archive here.

Adventures in Manipulating Python ASTs

George Ho

The Problem

The Solution

Lessons Learnt

Further Reading

Share on

Want to hear more from me?