Reproducible Experiments with Jupyter Notebooks and Guild AI

Notebooks are designed for flexible, ad hoc Python development. This presents a challenge to reproducibility. Cells can be run in any order and modified at any time. How can you be certain that cell output accurately reflects the current source? Even with careful discipline, you can’t be sure that a result is consistent with the code.

Guild AI – a light weight, open source experiment tracking tool – addresses this using a simple scheme:

Execute a copy of the notebook from top-to-bottom as an experiment artifact.

The result is a notebook whose output accurately reflects its source code at the time of execution. It’s a reliable experiment record.

Guild performs additional tasks when running a Jupyter Notebook:

Modify the notebook copy with run-specific parameter values (e.g. use a different learning rate for the experiment)
Save experiment metadata, output, and generated files
Export the fully executed notebook as HTML for easy viewing
Save images and scalar summaries for comparison in TensorBoard

Example: Run a Notebook

Here’s a sample notebook that simulates a training run. The code is adapted from _Bayesian Optimization with skopt._

The train function simulates a training process. It take two hyperparameters as inputs and prints a sample "loss".

Run this notebook as an experiment with Guild AI:

$ guild run train.ipynb

Guild inspects the notebook and discovers two hyperparameters: x and noise.

You are about to run train.ipynb
  noise: 0.1
  x: 1.0
Continue? (Y/n)

After confirmation, Guild first copies train.ipynb to a unique directory, which contains the experiment data. This ensures that the experiment is distinct from the original notebook and from any other experiments. Guild modifies the experiment copy to reflect the flag values specified for the run. In this case, Guild uses the default values defined in the notebook. Next, Guild runs all cells in the notebook copy from top-to-bottom. The result of each cell is saved in the notebook copy. The original notebook is unmodified.

Here’s the output for the run:

INFO: [guild] Initializing train.ipynb for run
INFO: [guild] Executing train.ipynb
loss: -0.116769
INFO: [guild] Saving HTML

Because the notebook copy does not change after the run, it’s a trustworthy record.

Run the original notebook again, this time with a different value for x:

$ guild run train.ipynb x=-1.0
You are about to run train.ipynb
  noise: 0.1
  x: -1.0
Continue? (Y/n)

Guild creates a new experiment with another notebook copy. The copy uses a different value for x.

INFO: [guild] Initializing train.ipynb for run
INFO: [guild] Executing train.ipynb
loss: 0.349546
INFO: [guild] Saving HTML

Note that in this experiment the loss is higher. The new value of x appears to be suboptimal.

You now have two experiments, each with different values for x and corresponding losses. Show available runs:

$ guild runs
[1:66cec0f3]  train.ipynb  completed  noise=0.1 x=-1.0
[2:ccbd822a]  train.ipynb  completed  noise=0.1 x=1.0

In addition to running and recording experiments, Guild provides features to stufy them:

Compare results – e.g. find the run with the lowest loss
View files and metadata for a particular experiment
Diff files to see what changed across experiments
Compare plots and images
Use results to suggest optimal hyperparameters for new experiments

Compare Runs

Compare flags and results across experiments directly from the command line:

$ guild compare

Guild shows the two runs with their respective flag values and losses. If you have a question about model performance, use this method to quickly evaluate results. Note that the latest experiment, which appears in the top row, has a higher loss than the first experiment.

Exit the program by pressing q.

You can also compare runs graphically:

$ guild view

This commands opens Guild View in your browser, which you use to explore run details and compare results side-by-side. To exit, return to the command prompt and press Ctrl-c.

Diff Notebooks

By saving each experiment as a separate notebook, you can compare two experiments by diffing their notebook files. Guild uses nbdime if installed to show differences side-by-side in your browser.

Show the differences between the two experiment notebook files:

$ guild diff --path train.ipynb

This command applies to train.ipynb for the two latest runs. You can specify different runs as needed.

Compare experiment notebooks side-by-side

Note the differences:

The value for x changed from 1.0 to -1.0
The loss goes up, suggesting that the value -1.0 for x is suboptimal

Note also that the train function is the same for each experiment. If the function was modified, the precise line-by-line changes would be shown above. This lets you evaluate all relevant aspects of your experiments.

Hyperparameter Search

Guild supports batch runs to generate multiple experiments with a single command. This is used for a variety of hyperparameter search methods, including grid search, random search, and sequential optimization.

For example, you can run a search using Bayesian optimization with Gaussian processes to find values of x that minimize loss:

$ guild run train.ipynb x=[-2.0:2.0] --optimizer gp --minimize loss
You are about to run train.ipynb with 'skopt:gp' optimizer (max 20 trials, minimize loss)
  noise: 0.1
  x: [-2.0:2.0]
Optimizer flags:
  acq-func: gp_hedge
  kappa: 1.96
  noise: gaussian
  random-starts: 3
  xi: 0.05
Continue? (Y/n)

This command runs a series of experiments using values of x between -2.0 and 2.0 that have a higher probability of minimizing loss based on previous results. By default, Guild generates 20 runs. You can change this to run more or fewer experiments as needed.

When Guild completes the command, you have 22 runs – the original two plus 20 from the search.

Use guild compare or guild view (see above) to compare runs.

You can use TensorBoard to evaluate hyperparameters. Guild integrates TensorBoard as a visualization tool for runs.

$ guild tensorboard

This command opens TensorBoard in your browser. Click HPARAMS and then PARALLEL COORDINATES VIEW to visualize the impact of hyperparameters on result metrics.

Click-and-drag the mouse over the lower portion of the loss axis. This highlights experiments with optimal values for x.

Evaluate experiment results using parallel coordinates view

You can see that optimal values for x fall between -0.5 and 0.0

Next, click SCATTERPLOT MATRIX VIEW.

Evaluate relationships between hyperparameters and metrics with scatter plots

This shows plots comparing hyperparameters and metrics. The comparison between x and loss shows optimal runs – i.e. where loss is lowest – clustered along the bottom axis for values of x just below 0. The area is highlighted in yellow above.

Compare the highlighted scatter plot to the know relationship between x and loss based on the train function:

Relationship between x and loss (credit)

The points in the scatter plot correspond to the known relationship between x and loss. There are relatively more results around the global negative for loss than for other values. This suggests some degree of effectiveness of the Bayesian model for predicting optimal hyperparameters, at least for this simple example.

Real word models aren’t this simple. This example merely highlights what we already know about x and loss. Nonetheless, you can apply this technique to any model for important insights and optimization that would be difficult without systematic measurement.

Reproducibility and Jupyter Notebooks

This method supports fully reproducible notebook experiments. To reproduce an experiment, run the notebook using Guild and compare the results to a baseline.

Results can be compared numerically – for example comparing metrics such as loss, accuracy, auc, etc. Results can also be compared qualitatively by comparing plots and other experiment artifacts side-by-side.

Consider the following SVM classification boundaries for two experiments (unrelated to our mock example):

A naive numerical comparison of classification loss might show that these models perform similarly. They in fact are very different. This is obvious when you look at the plots. Whether or not one result successfully reproduces another is a matter of further discussion. Regardless, having the full range of evaluation criteria is key to that process.

Lightweight Approach

Note Guild’s lightweight approach to run and track experiments:

The original notebook is not modified to support Guild features. Flags, output, metrics and plots are available as results without the use of Guild-specific APIs.
There is no additional code required to run experiments. You use external commands. This applies to a single run and to batches such hyperparameter optimization.
Results are saved locally to disk. You don’t install databases or other services. You don’t send data to the cloud.
Guild integrates with best-of-class tools rather than compete with them. You use TensorBoard and nbdime, both of which are feature rich tools that specialize in specific tasks.

Summary

To accurately record an experiment with a Jupyter Notebook, it’s important that cell output reflects cell source code. This is a challenge with notebooks as you’re free to edit and execute cells in any order at any time. While this flexibility can work for ad hoc, experimental development, you need reliable controls for accurate experiment tracking.

Guild AI addresses this by automating notebook execution. When you run a notebook with Guild, there are no opportunities for manual intervention or out-of-order cell execution. Guild creates a notebook copy for each experiment and executes its cells from top to bottom. This yields consistent results and ensures that changes to the original notebook do not affect them.

Alongside this execution model, Guild provides features for viewing and comparing experiments. With these tools you can leverage your work to make informed decisions about next steps. You can detect regressions and catch unexpected changes.

Guild provides this support as a lightweight, feature rich tool that’s freely available under the Apache 2 open source license.