Skip to main content

A defined interface for working with a cache of jupyter notebooks.

Project description

Install | Example | Contributing

jupyter-cache

Build Status Coverage Status

A defined interface for working with a cache of jupyter notebooks.

NOTE: This package is in an Alpha stage and liable to change.

Some desired requirements (not yet all implemented):

  • Persistent
  • Separates out "edits to content" from "edits to code cells". Cell rearranges and code cell changes should require a re-execution. Content changes should not.
  • Allow parallel access to notebooks (for execution)
  • Store execution statistics/reports
  • Store external assets: Notebooks being executed often require external assets: importing scripts/data/etc. These are prepared by the users.
  • Store execution artifacts: created during exeution
  • A transparent and robust cache invalidation: imagine the user updating an external dependency or a Python module, or checking out a different git branch.

Install

pip install -e "git+https://github.com/ExecutableBookProject/jupyter-cache.git#egg=jupyter-cache[cli]"

For development:

git clone https://github.com/ExecutableBookProject/jupyter-cache
cd jupyter-cache
git checkout develop
pip install -e .[cli,code_style,testing]

Example API usage

to come ...

Example CLI usage

From the checked-out repository folder:

$ jcache --help
Usage: jcache [OPTIONS] COMMAND [ARGS]...

  The command line interface of jupyter-cache.

Options:
  -v, --version       Show the version and exit.
  -p, --cache-path    Print the current cache path and exit.
  -a, --autocomplete  Print the autocompletion command and exit.
  -h, --help          Show this message and exit.

Commands:
  cache    Commands for adding to and inspecting the cache.
  clear    Clear the cache completely.
  config   Commands for configuring the cache.
  execute  Execute staged notebooks that are outdated.
  stage    Commands for staging notebooks to be executed.

Important: Execute this in the terminal for auto-completion:

eval "$(_JCACHE_COMPLETE=source jcache)"

Caching Executed Notebooks

$ jcache cache --help
Usage: cache [OPTIONS] COMMAND [ARGS]...

  Commands for adding to and inspecting the cache.

Options:
  --help  Show this message and exit.

Commands:
  add                 Cache notebook(s) that have already been executed.
  add-with-artefacts  Cache a notebook, with possible artefact files.
  cat-artifact        Print the contents of a cached artefact.
  diff-nb             Print a diff of a notebook to one stored in the cache.
  list                List cached notebook records in the cache.
  remove              Remove notebooks stored in the cache.
  show                Show details of a cached notebook in the cache.

The first time the cache is required, it will be lazily created:

$ jcache cache list
Cache path: ../.jupyter_cache
The cache does not yet exist, do you want to create it? [y/N]: y
No Cached Notebooks

You can add notebooks straight into the cache. When caching, a check will be made that the notebooks look to have been executed correctly, i.e. the cell execution counts go sequentially up from 1.

$ jcache cache add tests/notebooks/basic.ipynb
Caching: ../tests/notebooks/basic.ipynb
Validity Error: Expected cell 1 to have execution_count 1 not 2
The notebook may not have been executed, continue caching? [y/N]: y
Success!

Or to skip validation:

$ jcache cache add --no-validate tests/notebooks/basic.ipynb tests/notebooks/basic_failing.ipynb tests/notebooks/basic_unrun.ipynb tests/notebooks/complex_outputs.ipynb tests/notebooks/external_output.ipynb
Caching: ../tests/notebooks/basic.ipynb
Caching: ../tests/notebooks/basic_failing.ipynb
Caching: ../tests/notebooks/basic_unrun.ipynb
Caching: ../tests/notebooks/complex_outputs.ipynb
Caching: ../tests/notebooks/external_output.ipynb
Success!

Once you've cached some notebooks, you can look at the 'cache records' for what has been cached.

Each notebook is hashed (code cells and kernel spec only), which is used to compare against 'staged' notebooks. Multiple hashes for the same URI can be added (the URI is just there for inspetion) and the size of the cache is limited (current default 1000) so that, at this size, the last accessed records begin to be deleted. You can remove cached records by their ID.

$ jcache cache list
  ID  Origin URI                             Created           Accessed
----  -------------------------------------  ----------------  ----------------
   5  tests/notebooks/external_output.ipynb  2020-03-12 17:31  2020-03-12 17:31
   4  tests/notebooks/complex_outputs.ipynb  2020-03-12 17:31  2020-03-12 17:31
   3  tests/notebooks/basic_unrun.ipynb      2020-03-12 17:31  2020-03-12 17:31
   2  tests/notebooks/basic_failing.ipynb    2020-03-12 17:31  2020-03-12 17:31

Tip: Use the --latest-only option, to only show the latest versions of cached notebooks.

You can also cache notebooks with artefacts (external outputs of the notebook execution).

$ jcache cache add-with-artefacts -nb tests/notebooks/basic.ipynb tests/notebooks/artifact_folder/artifact.txt
Caching: ../tests/notebooks/basic.ipynb
Validity Error: Expected cell 1 to have execution_count 1 not 2
The notebook may not have been executed, continue caching? [y/N]: y
Success!

Show a full description of a cached notebook by referring to its ID

$ jcache cache show 6
ID: 6
Origin URI: ../tests/notebooks/basic.ipynb
Created: 2020-03-12 17:31
Accessed: 2020-03-12 17:31
Hashkey: 818f3412b998fcf4fe9ca3cca11a3fc3
Artifacts:
- artifact_folder/artifact.txt

Note artefact paths must be 'upstream' of the notebook folder:

$ jcache cache add-with-artefacts -nb tests/notebooks/basic.ipynb tests/test_db.py
Caching: ../tests/notebooks/basic.ipynb
Artifact Error: Path '../tests/test_db.py' is not in folder '../tests/notebooks''

To view the contents of an execution artefact:

$ jcache cache cat-artifact 6 artifact_folder/artifact.txt
An artifact

You can directly remove a cached notebook by its ID:

$ jcache cache remove 4
Removing Cache ID = 4
Success!

You can also diff any of the cached notebooks with any (external) notebook:

$ jcache cache diff-nb 2 tests/notebooks/basic.ipynb
nbdiff
--- cached pk=2
+++ other: ../tests/notebooks/basic.ipynb
## inserted before nb/cells/0:
+  code cell:
+    execution_count: 2
+    source:
+      a=1
+      print(a)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          1

## deleted nb/cells/0:
-  code cell:
-    source:
-      raise Exception('oopsie!')


Success!

Staging Notebooks for execution

$ jcache stage --help
Usage: stage [OPTIONS] COMMAND [ARGS]...

  Commands for staging notebooks to be executed.

Options:
  --help  Show this message and exit.

Commands:
  add              Stage notebook(s) for execution.
  add-with-assets  Stage a notebook, with possible asset files.
  list             List notebooks staged for possible execution.
  remove-ids       Un-stage notebook(s), by ID.
  remove-uris      Un-stage notebook(s), by URI.
  show             Show details of a staged notebook.

Staged notebooks are recorded as pointers to their URI, i.e. no physical copying takes place until execution time.

If you stage some notebooks for execution, then you can list them to see which have existing records in the cache (by hash), and which will require execution:

$ jcache stage add tests/notebooks/basic.ipynb tests/notebooks/basic_failing.ipynb tests/notebooks/basic_unrun.ipynb tests/notebooks/complex_outputs.ipynb tests/notebooks/external_output.ipynb
Staging: ../tests/notebooks/basic.ipynb
Staging: ../tests/notebooks/basic_failing.ipynb
Staging: ../tests/notebooks/basic_unrun.ipynb
Staging: ../tests/notebooks/complex_outputs.ipynb
Staging: ../tests/notebooks/external_output.ipynb
Success!
$ jcache stage list
  ID  URI                                    Created             Assets    Cache ID
----  -------------------------------------  ----------------  --------  ----------
   5  tests/notebooks/external_output.ipynb  2020-03-12 17:31         0           5
   4  tests/notebooks/complex_outputs.ipynb  2020-03-12 17:31         0
   3  tests/notebooks/basic_unrun.ipynb      2020-03-12 17:31         0           6
   2  tests/notebooks/basic_failing.ipynb    2020-03-12 17:31         0           2
   1  tests/notebooks/basic.ipynb            2020-03-12 17:31         0           6

You can remove a staged notebook by its URI or ID:

$ jcache stage remove-ids 4
Unstaging ID: 4
Success!

You can then run a basic execution of the required notebooks:

$ jcache cache remove 6 2
Removing Cache ID = 6
Removing Cache ID = 2
Success!
$ jcache execute
Executing: ../tests/notebooks/basic.ipynb
Execution Succeeded: ../tests/notebooks/basic.ipynb
Executing: ../tests/notebooks/basic_failing.ipynb
error: Execution Failed: ../tests/notebooks/basic_failing.ipynb
Executing: ../tests/notebooks/basic_unrun.ipynb
Execution Succeeded: ../tests/notebooks/basic_unrun.ipynb
Finished! Successfully executed notebooks have been cached.
succeeded:
- ../tests/notebooks/basic.ipynb
- ../tests/notebooks/basic_unrun.ipynb
excepted:
- ../tests/notebooks/basic_failing.ipynb
errored: []

Successfully executed notebooks will be cached to the cache, along with any 'artefacts' created by the execution, that are inside the notebook folder, and data supplied by the executor.

$ jcache stage list
  ID  URI                                    Created             Assets    Cache ID
----  -------------------------------------  ----------------  --------  ----------
   5  tests/notebooks/external_output.ipynb  2020-03-12 17:31         0           5
   3  tests/notebooks/basic_unrun.ipynb      2020-03-12 17:31         0           6
   2  tests/notebooks/basic_failing.ipynb    2020-03-12 17:31         0
   1  tests/notebooks/basic.ipynb            2020-03-12 17:31         0           6

Execution data (such as execution time) will be stored in the cache record:

$ jcache cache show 6
ID: 6
Origin URI: ../tests/notebooks/basic_unrun.ipynb
Created: 2020-03-12 17:31
Accessed: 2020-03-12 17:31
Hashkey: 818f3412b998fcf4fe9ca3cca11a3fc3
Data:
  execution_seconds: 1.0559415130000005

Failed notebooks will not be cached, but the exception traceback will be added to the stage record:

$ jcache stage show 2
ID: 2
URI: ../tests/notebooks/basic_failing.ipynb
Created: 2020-03-12 17:31
Failed Last Execution!
Traceback (most recent call last):
  File "../jupyter_cache/executors/basic.py", line 152, in execute
    executenb(nb_bundle.nb, cwd=tmpdirname)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 737, in executenb
    return ep.preprocess(nb, resources, km=km)[0]
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 405, in preprocess
    nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
    nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/anaconda/envs/mistune/lib/python3.7/site-packages/nbconvert/preprocessors/execute.py", line 448, in preprocess_cell
    raise CellExecutionError.from_cell_and_msg(cell, out)
nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
raise Exception('oopsie!')
------------------

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-714b2b556897> in <module>
----> 1 raise Exception('oopsie!')

Exception: oopsie!
Exception: oopsie!

Once executed you may leave staged notebooks, for later re-execution, or remove them:

$ jcache stage remove-ids --all
Are you sure you want to remove all? [y/N]: y
Unstaging ID: 1
Unstaging ID: 2
Unstaging ID: 3
Unstaging ID: 5
Success!

You can also stage notebooks with assets; external files that are required by the notebook during execution. As with artefacts, these files must be in the same folder as the notebook, or a sub-folder.

$ jcache stage add-with-assets -nb tests/notebooks/basic.ipynb tests/notebooks/artifact_folder/artifact.txt
Success!
$ jcache stage show 1
ID: 1
URI: ../tests/notebooks/basic.ipynb
Created: 2020-03-12 17:31
Cache ID: 6
Assets:
- ../tests/notebooks/artifact_folder/artifact.txt

Contributing

Code Style

Code style is tested using flake8, with the configuration set in .flake8, and code formatted with black.

Installing with jupyter-cache[code_style] makes the pre-commit package available, which will ensure this style is met before commits are submitted, by reformatting the code and testing for lint errors. It can be setup by:

>> cd jupyter-cache
>> pre-commit install

Optionally you can run black and flake8 separately:

>> black .
>> flake8 .

Editors like VS Code also have automatic code reformat utilities, which can adhere to this standard.

Pull Requests

To contribute, make Pull Requests to the develop branch (this is the default branch). A PR can consist of one or multiple commits. Before you open a PR, make sure to clean up your commit history and create the commits that you think best divide up the total work as outlined above (use git rebase and git commit --amend). Ensure all commit messages clearly summarise the changes in the header and the problem that this commit is solving in the body.

Merging pull requests: There are three ways of 'merging' pull requests on GitHub:

  • Squash and merge: take all commits, squash them into a single one and put it on top of the base branch. Choose this for pull requests that address a single issue and are well represented by a single commit. Make sure to clean the commit message (title & body)
  • Rebase and merge: take all commits and 'recreate' them on top of the base branch. All commits will be recreated with new hashes. Choose this for pull requests that require more than a single commit. Examples: PRs that contain multiple commits with individually significant changes; PRs that have commits from different authors (squashing commits would remove attribution)
  • Merge with merge commit: put all commits as they are on the base branch, with a merge commit on top Choose for collaborative PRs with many commits. Here, the merge commit provides actual benefits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jupyter-cache-0.1.0a1.tar.gz (28.1 kB view hashes)

Uploaded Source

Built Distribution

jupyter_cache-0.1.0a1-py3-none-any.whl (30.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page