Skip to main content

Millrun: A Python library and CLI tool to automate the execution of notebooks

Project description

Millrun

A Python library and CLI tool for automating the execution of papermill

Motivation

Papermill is great: it parameterizes a single notebook for you. Ok, so what about this whole directory of notebooks that I would like to execute with this list of different parameters?

Millrun Will execute either a single notebook or all of the notebooks in a directory (recursively, if you want) and using either a list of alternative parameter dictionaries or a dictionary with a list of variations.

In short, it iterates both over notebooks in a directory AND over lists of parameters.

When executed as a CLI tool, notebooks are executed in parallel using multi-processing.

Installation

pip install millrun

Usage: Python Library

import millrun

millrun.execute_run(
    notebook_dir_or_file: pathlib.Path | str,
    bulk_params: list | dict,
    output_dir: Optional[pathlib.Path | str] = None,
    output_prepend_components: Optional[list[str]] = None,
    output_append_components: Optional[list[str]] = None,
    recursive: bool = False,
    exclude_glob_pattern: Optional[str] = None,
    include_glob_pattern: Optional[str] = None,
    use_multiprocessing: bool = False,
    **kwargs, # kwargs are passed through to papermill
)

Usage: CLI tool

millrun --help
                                                                                                       
 Usage: millrun [OPTIONS] NOTEBOOK_DIR_OR_FILE PARAMS                                                  
                                                                                                       
 Executes a notebook or directory of notebooks using the provided bulk parameters JSON file            
                                                                                                       
                                                                                                       
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
│ *    notebook_dir_or_file      TEXT  Path to a notebook file or a directory containing notebooks.   │
│                                      [default: None]                                                │
│                                      [required]                                                     │
│ *    notebook_params           TEXT  JSON file that contains parameters for notebook execution. Can │
│                                      either be a 'list of dict' or 'dict of list'.                  │
│                                      [default: None]                                                │
│                                      [required]                                                     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --output-dir                                TEXT  Directory to place output files into. If not      │
│                                                   provided the file directory will be used.         │
│                                                   [default: None]                                   │
│ --prepend                                   TEXT  Prepend components to use on output filename.Can  │
│                                                   use dict keys from 'params' which will be         │
│                                                   evaluated.(Comma-separated values).               │
│                                                   [default: None]                                   │
│ --append                                    TEXT  Append components to use on output filename.Can   │
│                                                   use dict keys from 'params' which will be         │
│                                                   evaluated.(Comma-separated values).               │
│                                                   [default: None]                                   │
│ --recursive               --no-recursive          [default: no-recursive]                           │
│ --exclude-glob-pattern                      TEXT  [default: None]                                   │
│ --include-glob-pattern                      TEXT  [default: None]                                   │
│ --help                                            Show this message and exit.                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

Example

While the prepend argument is optional, it is highly recommend you take advantage of it. If not, your output file names will be automatically prepended with an integer index to differentiate the output files.

millrun ./Notebooks_Dir params.json --prepend id_key_in_params

Where id_key_in_params is one of the keys in your params.json that you can use to uniquely identify each iteration. If you do not have a single unique key, you can provide a list of keys and they will all be prepended:

Lets say my params.json looked like this:

{
    "x_values": [0, 1, 2],
    "y_values": [45, 32, 60],
}

I could execute like this:

millrun ./Notebooks_Dir params.json --prepend x_values,y_values,results

And my output files would look like:

0-45-results-special_calculation.ipynb
1-32-results-special_calculation.ipynb
2-60-results-special_calculation.ipynb

Notice: Since "results" was not a key in my params.json, it gets passed through as a string literal.

Organizing your parameters

You can have your parameters dictionary/JSON in one of two formats:

Format 1: A list of dicts

[
    {"param1": 0, "param2": "hat", "param3": 21.2},
    {"param1": 1, "param2": "cat", "param3": 34.3},
    {"param1": 2, "param2": "bat", "param3": 200.0}
]

Where each notebook given to millrun will execute against each dictionary in the list.

Format 2: A dict of lists

{
    "param1": [0, 1, 2],
    "param2": ["hat", "cat", "bat"],
    "param3": [21.2, 34.3, 200.0]
}

This format is offered as a convenience format. Internally, it is converted into "Format 1" prior to execution.

CLI Profile execution

As of v0.2.0, millrun allows the creation of a "profiles" yaml file which prevents the need for typing really long commands on the command line, especially if, for a particular project, the commands are always going to be the same.

YAML format:

The format basically describes the kwargs required to execute the command.

The top level keys can be arbitrarily named but they represent one command execution. The values underneath each top level key are the kwargs of the command.

The only required values are notebook_dir_or_file and notebook_params. All other params are optional.

notebook1: # This is the name of the profile. A profile is equal to one command on the command line
  notebook_dir_or_file: ./notebook1/notebook1.ipynb # Req'd
  notebook_params: ./notebook1/notebook1_params.json # Req'd
  output_dir: ./notebook1/output # Optional
  prepend: # Optional
    - name
    - design
  append: # Optional
    - executed

notebook2: # This profile will be executed immediately after the first profile. It's like running the command again.
  notebook_dir_or_file: ./notebook2
  notebook_params: ./notebook2/notebook2_params.json
  output_dir: ./notebook2/output
  prepend:
    - tester

CLI parallel execution

Since millrun iterates over two dimensions (each notebook and then dict of parameters in the list), there are two ways of parellelizing:

  1. Execute each notebook in sequence and parallelize the execution of the different parameter variations
  2. Execute each notebook in parallel and sequentialize the execution of the different parameter variations

Because of my own personal use cases, it is more efficient for me to use 1. because I have way more parameter variations than I do notebooks.

However, this method becomes inefficient if you have MANY notebooks and only 1-3 variations. In that case, you would probably prefer the method 2.. It is still faster than single-process execution (like you get )

If you need this use case then feel free to raise an issue and/or contribute a PR to implement it as an option for execution.

Troubleshooting

There seems to be an un-planned-for behaviour (by me) with the parallel execution where if there is an error in the execution process, that iteration is simply skipped. I don't have any try/except in the code that causes this.

So, if you are finding that execution seems to happen "too quickly" or you have missing files, try executing your run in single-process mode as a Python library and see if you get any errors. Then correct and re-run in CLI mode.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

millrun-0.2.0.tar.gz (112.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

millrun-0.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file millrun-0.2.0.tar.gz.

File metadata

  • Download URL: millrun-0.2.0.tar.gz
  • Upload date:
  • Size: 112.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for millrun-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d95eff65d777c09f5643489bc7e73f499e7f91c8982ac806db64a092c96ee253
MD5 31d67fcd11f188c2dd84934e172e6dc5
BLAKE2b-256 d6d9fb69c44db5d3186a3ceabe3c3129dc9d1c1b4b2847d119319c2cd8dfece0

See more details on using hashes here.

File details

Details for the file millrun-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: millrun-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for millrun-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dcfb5d5393a30e7e76c52a7023548caa0ddb51f499e6a0f608c6ed0b80b5ed0a
MD5 08464b11bbb25be5b7d6e736b3be9e57
BLAKE2b-256 2097371ab3f320c936d5a8122effd87fc0ce9dfc543dda1f0e4636f18178c06f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page