Research patterns for machine learning

These are some helpful patterns for creating what I have found is a robust machine learning research workflow. The word “pattern” is used here very loosely to refer to “general approaches that work well”.

This post is based on this excellent article; credit goes to the original author for many of the ideas.

I’ve summarised the above article in part and added some of my own thoughts (in no particular order):

Always keep something running
Make the most of the computational resources you have and always have something running in the background. Even if you think it’s stupid it could tell you something interesting. Keep the queue (if you have one) filled with fodder whilst developing. Never leave the queue empty.
Always use version control
Standard software engineering best-practices should apply and version control is no exception. Regular commits are doubly important here because in ML research reproducibility is paramount.

Additional thoughts:
* Tag code with “releases” for each project so you can roll back with confidence to a point where a certain amount of functionality was implemented. Allows experiments to be rerun from different points of the code’s evolution.
* If it can be regenerated easily, don’t commit it. e.g large models or experiment results
Separate code from data
This is a must. Data generated during experiments (and with dependencies on a particular experiment) should be stored separately. Keep the codebase unpolluted. Things which stay the same across lots of experiments, e.g preprocessed features, should also be moved to a permanent location.Setup so that data can be swapped in and out without dependencies breaking.In fact, each stage in the pipeline should be isolated with clearly defined inputs and outputs so that any one stage can be independently optimised.
Save everything to disk frequently
That means during training – and particularly for long runs – model parameters get saved frequently to disk. Ideally, dump to disk at time intervals that mean you’re only just comfortable with hardware failure occurring at any moment. That might mean each model iteration. It might mean every 30 mins.
Save with sensible names
Runs on different data sets and different parameter settings should reflect those differences. Prepend a date to each folder (things appear nicely in date order if you use (YYYYMMDD will allow it to be correctly sorted!). Include the name of the group of experiments. Example:
```
20141021adadeltaTest_alpha0.9_mbSize52
```
Make experiments reproducible
That means storing config files containing hyperparameters, parameters, links to the datasets used and experiment-specific info. Ensure failed runs can be restarted halfway through. This is a lower priority item but can be useful once an idea has been validated and you start running longer experiments. My best advice here: copy (yes copy!) your entire codebase alongside each experiment. That might sound overkill but it helps greatly with reproducibility – knowing that a trained model always has the exact code that produced it and that can run it next to it gives great peace of mind. Logging the git-sha isn’t enough – you will at some point run with uncommitted modifications and you will be caught out.
Don’t over-automate
There’s a temptation to script everything to death but during research, you need results to test ideas as fast as possible. Script and automate as much as you can whilst waiting for results, otherwise try not to waste too much time making things perfect.
Automate parameter sweeps
If you do automate something, this one is probably worth it. Make it really simple to quickly launch lots of jobs over a range of parameters. Some frameworks already support this. Bash is great for this. My rule of thumb: if it can fit on one screen it’s fine in bash, any longer then consider python.
Don’t reinvent the wheel
Reuse code where possible. Correctness is hard to come by in ML and often is a product of sweat and tears. Use someone else’s sweat and tears. Prefer libraries that allow you to test your ideas as fast as possible.
Keep a log of experiments
Each folder should contain a log entry for why you ran the particular experiment.Results should not be buried deep inside multiple log files but in one single file which is easy to read. It should also contain runtime info such as:* date the experiment started
* which machine it ran on
* which config files were used
* how far through the experiment currently is
* what the intermediate results are
Kill bad runs early
Make sure to catch bad runs quickly, either by hand or automatically so you can tighten the debug/test feedback loop as much as possible. If optimising some objective, observe if there is a high correlation between poor final performance and poor performance in the first 5%. If there is you have a good signal to use to kill runs early. This is often the case with overparameterised neural networks.
Separate options from parameters
Make sure algorithm parameters (e.g working directories, hyperparameters) are separate from model parameters (e.g binaries which contain model weights).
Completely eliminate sources of variation in the environment
Nip this one in the bud. Always remove non-determinism in code in the first instance. Once we have guaranteed correctness then determinism may be traded off for improved convergence speed (i.e through distributed multi-GPU based training). For example, in Tensorflow, that means forcing these options at the outset:
```
tf.logging.set_verbosity(3)  # Print INFO log messages.
np.random.seed(1)
random.seed(1)
tf.reset_default_graph()
...
deterministic_config = tf.ConfigProto(inter_op_parallelism_threads=1,
                                      intra_op_parallelism_threads=1)
with tf.Session(config=deterministic_config) as sess:
     #train
```
If you can, pin all dependencies and run everything inside docker. That can guarantee the environment won’t change under your feet.

tersetalk

A Computer Science and ML blog

Research patterns for machine learning

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply