Mar 1, 2018

Guildsman Scope Creep

…and wondering about others' ML workflows

It’s been a few months in a row now that I’ve pushed back the Guildsman alpha deadline by a month. I’m ok with that. At work, we adhere to the practice of continuously shipping small, iterative improvements. I’m very happy that we do that. However, for better or worse, with personal projects, I often like to take a luxurious amount of time to play around and explore ideas. With Guildsman, I’ve been thinking a lot about the differences between traditional software and machine learning, and how those differences affect the development workflow.

DISCLAIMER: I’m a novice in the field of ML. I’ve spent a significant amount of time googling and asking questions of others, but my understanding of these topics covered below is not that of an expert.

If I’m getting things wrong, please get in touch and help me head in the right direction. Thanks

Training

The Process, The Product and The Problems

Compared to traditional software development, machine learning includes the additional step of training and also an additional artifact, the trained model.

In traditional software development, you perform some sort of test on modified source before considering it for production use. Whether it’s running a full test suite or just exercising a few cases from the REPL, there’s some process that yields feedback and informs decisions about whether this development session is complete.

Training Can Be Slow

In machine learning, there’s no exact analog to the boolean correctness result that a test gives you, but there is some sort of fitness score. "This trained model correctly identifies cats 90% of the time. Let’s ship it!" This fitness score is the result of testing a trained model. Therefore, training the model is part of the development process — a critical part. Training a model, however, can take orders of magnitude longer than the typical automated steps of the development workflow — compilation and testing. The length of the development iteration cycle can explode; progress slows to a crawl; disillusionment sets in etc.

There’s only so much that can be done to accelerate model training, but I believe there is a lot that can be done to mitigate its negative impact on the development process. I’ve been noodling on this goal.

Where Do You Keep the Trained Model?

Ok, training is done and you have a trained model. It correctly identifies cats at an astounding level of accuracy. Yay! But, what do you with that model? How do you deploy it to production? In then normal Clojure/Java world, deployment might mean dropping a JAR file on a server. What’s the equivalent for ML? Do you include your trained model in the JAR as a resource, or drop it nearby on the file system? These both seem fine, I guess. But then, how do you keep track of which model has been deployed and what the expected performance is? My team’s deployment system remembers which git commit is on the server. We also have a record that all tests passed for that commit in our CI tool. Where do you record test results for the deployed model? (Or any model?) Or stepping back, we need data to run these tests. There’s all sorts of organizational questions related to that:

Where is the test data kept?
How is it made available to the test runner?
Is the test data versioned?
Where is it recorded which version of the test data produced which test result?

I’m not the only one asking these questions, of course. There are services and software solutions that attempt to addresses at least some of the above. There are more in the works. For TensorFlow, Google offers TensorFlow Serving. I think they’re headed in the right direction, but it is far from addressing everything above.

I want a high-quality ML dev experience. I don’t think we’re there yet.

Attempting a Foundation for Solutions

I haven’t solved these problems, but I have given them a lot of attention. Below I’ve outlined the features I’ve built into Guildsman and how they attempt to at least serve as a good foundation for possible future solutions.

Dataset Packages

Think about how great git commits are.

They are immutable.
They have a unique identifier.
They expose their history (parent commits).

And with a bit of context (the repo), they are easily:

discoverable
obtainable

The ideal dataset would have all of these properties. The ideal ML workflow would leverage an infrastructure that supports such datasets. With this in mind, I instilled in Guildsman the concept of packages. As is the case with most features of Guildsman, the implementation is minimal, but the foundation is there.

The goal is a scenario where you can drop a package identifier as in input into your graph and Guildsman will be able to:

determine whether the package (dataset) has been previously procured
obtain it if necessary (ex. download from web server)
wire up the necessary nodes and point them to wherever the dataset is (likely a file on the local filesystem)

Added bonus: In addition to the dataset itself, packages can contain portions of graphs, as well. That opens up all sorts of possibilities, but the primary intended use is to be able to package a dataset together with the TensorFlow nodes to interpret the format of the dataset. I’m pretty excited about the potential here.

Imagine a world where repos of versioned, publicly accessible datasets are plentiful. Guildsman’s packaging system could integrate into such repos. For more on that dream, check out https://qri.io/.

Workflows and Plugins

How do you make the training process faster? By training on high-end GPUs! But these resources are expensive. It’s not cost-effective for every dev to have their own. Instead, training-server resources must be shared somehow. The development workflow now extends outside the local machine to include these servers. For me, this is not typical or ideal.

We’ve improved the development iteration cycle in one respect, but likely worsened in another. Ideally, a training server would be made available for development use in a seamless manner. That’s my goal with Guildsman’s workflow feature.

Workflows are how Guildsman carries out activities, such as training and prediction. They are a series of actions to perform on a TensorFlow graph. In that respect, they are like a function. There’s an important difference — Guildsman provides mechanisms to allow workflows to be specified as a data structure. I guess it’s a DSL of some kind. The data structure gets compiled to Clojure code and then eval’d to produce an actual funcion. Data-defined workflows buy you two things:

easy transmission to a remote server (for a seamless dev experience!)
ample surface area for plugins to hook in and alter behavior

Why do we need plugins? I only have two well-defined ideas about what plugins will be good for, but they’re both pretty big deals.

a plugin to neatly interleave behavior and features useful for development, but not necessary for production.
a plugin to seamlessly coordinate with remote training servers

If you are familiar with TensorFlow’s estimators, it might be helpful to think of workflows as a generalizaion of those.

Checkpoint Repo

As training occurs, TensorFlow automatically and periodically saves the state of variables into files called "checkpoints".

Guildsman expands on TensorFlow’s checkpoints by maintaining a database with details about the saved checkpoint files. This includes:

the structure of the TensorFlow graph
a log of what training was done to arrive at graph’s state
test results
test dataset identifier

In the future, I’d like to add support for pushing checkpoints and their metadata to a central repository. These checkpoint repos could be private and internal to an organization, or a public community resource. By serving as a locus for collaboration and organization of results, they could be an invaluable tool for ML teams. Perhaps a checkpoint repo could serve as a source for deployment as well.

As always, I expect to have an alpha release out in the next few weeks.

If you’d like to read more, or get in touch, check out the Guildsman README.