Render Jupyter Notebooks in Metaflow Cards

Project description

metaflow-card-notebook

Use @card(type='notebook') to programatically run & render notebooks in your Flows.

metaflow-card-notebook
Motivation
Installation
Example: Model Dasbhoard
Usage
Customized Rendering
Common Issues

Motivation

You may have seen this series of blog posts that have been written about Notebook Infrastructure at Netflix. Of particular interest is how notebooks are programatically run, often in DAGs, to generate reports and dashboards:

Exeucted Paramterized Notebooks	Notebooks in DAGs	Managing Dependencies & Scheduling

This way of generating reports and dasbhoards is very compelling, as it lets data scientists create content using environments and tools that they are familiar with. With @card(type='notebook') you can programatically run and render notebooks as part of a DAG., This card allows you to accomplish the following with an easy to use API:

Run notebook(s) programatically in your Metaflow DAGs.
Access data from any step in your DAG so you can visualize it or otherwhise use it to generate reports in a notebook.
Render your notebooks as reports or model cards that can be embedded in various apps.
Inject custom parameters into your notebook for execution.
Ensure that notebook outputs are reproducible.

Addiontally, you can use all of the features of Metaflow to manage execution of notebooks, for example:

Managing dependencies (ex: @conda)
Requesting compute (ex: @resources)
Parallel execution (ex: foreach)
etc.

For example, here is a screenshot of report you will be able to generate if you follow the steps below:

Installation

pip install metaflow-card-notebook

Example: Model Dasbhoard

Before diving into how this card works, it is instructive to run a example to whet your appetite. To run the example, follow the below steps:

Change into the example directory:
```
cd example
```
Run the example flow
```
python model_dashboard.py run
```
View the card created by the DAG you just ran:
```
python model_dashboard.py card nb_auto 
```
You will be presented with a simple dashboard:

To learn how to use this in your own workflows, proceed to the Usage section.

Usage

Step 1: Prepare your notebook

The notebook card injects the following three variables into your notebook:

run_id
task_id
flow_name

You can use these variables to retrieve the data you need from a Flow. It is recommended that the first cell in your notebook defines these variables, and that you designate this cell with the tag parameters.

For example of this, see tests/nbflow.ipynb:

Step 2: Prepare your flow with the notebook card

You can render cards from notebooks using the @card(type='notebook') decorator on a step. For example, in tests/nbflow.py, the notebook tests/nbflow.ipynb is run and rendered programatically:

from metaflow import step, current, FlowSpec, Parameter, card

class NBFlow(FlowSpec):

    exclude_nb_input = Parameter('exclude_nb_input', default=True, type=bool)

    @step
    def start(self):
        self.data_for_notebook = "I Will Print Myself From A Notebook"
        self.next(self.end)
    
    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', exclude_input=self.exclude_nb_input)

if __name__ == '__main__':
    NBFlow()

Note how the start step stores some data that we want to access from a notebook later. We will discuss how to access this data from a notebook in the next step.

By default, a step that is decorated with @card(type='notebook') expects the variable nb_options_dict to be defined in the step. This variable is a dictionary of arguments that is passed to papermill.exeucte.notebook. Only the input_path argument is required. If output_path is absent, this is automatically set to _rendered_<run_id>_<task_id>_<your_input_notebook_name>.ipynb.

Furthermore, the exclude_input is an additional boolean argument that specifies whether or not to show our hide cell outputs, which is False by default.

Step 3: Prototype the rest of your notebook

Recall that the run_id, task_id, and flow_name are injected into the notebook. We can access this in a notebook using Metaflow's utlities for inspecting Flows and Results. We demonstrate this in tests/nbflow.ipynb:

Some notes about this notebook:

We recommend printing the variables injected into the notebook. This can help with debugging and provide an easy to locate lineage.
We demonstrate how to access this data via a Step or a Task object. You can read more about the relationship between these items in these docs. In short a Task is a children of a Step, because a Step can have many tasks (for example if you use a foreach construct for parallelism).
We recommend exeucting a run manually, and prototyping the notebook interactively by temporarily supplying the run_id, flow_name, etc to achieve the desired result.

Step 4: Test the card

To test the card in the example outlined above, you must first run the flow (the parenthesis allows the commands run in a subshell):

(cd tests && python nbflow.py run)

Then, render the card

(cd tests && python nbflow.py card view end)

By default, the cell inputs are hidden when the card is rendered. For learning purposes it can be useful to render the card with the inputs to validate how card is executed. You can do this by setting the exclude_nb_input parameter to False that was defined in the flow:

(cd tests && python nbflow.py run --exclude_nb_input=False && python nbflow.py card view end)

Customized Rendering

The @card(type='notebook') is an opinionated way to execute and render notebooks with the tradeoff of requiring significantly less code. While some customization is possible by passing the appropriate arguments to nb_options_dict as listed in papermill.exeucte.notebook, you can achieve more fine-grained control by exeucting and rendering the notebook yourself and using the html card. We show an example of this in example/model_dashboard.py:

    @card(type='html')
    @step
    def nb_manual(self):
        """
        Run & Render Jupyter Notebook Manually With The HTML Card.
        
        Using the html card provides you greater control over notebook execution and rendering.
        """
        import papermill as pm
        output_nb_path = 'notebooks/rendered_Evaluate.ipynb'
        output_html_path = output_nb_path.replace('.ipynb', '.html')

        pm.execute_notebook('notebooks/Evaluate.ipynb',
                            output_nb_path,
                            parameters=dict(run_id=current.run_id,
                                             flow_name=current.flow_name,)
                             )
        run(f'jupyter nbconvert --to html --no-input --no-prompt {output_nb_path}')
        with open(output_html_path, 'r') as f:
            self.html = f.read()
        self.next(self.end)

You can the following command in your terminal the see output of this step(may take several minutes):

(cd example && python model_dashboard.py run && python model_dashboard.py card view nb_manual)

Common Issues

Papermill Arguments

Many issues can be resolved by providing the right arguments to papermill.exeucte.notebook. Below are some common issues and examples of how to resolve them:

Kernel Name: The name of the python kernel you use locally may be different from your remote execution environment. By default, papermill will attempt to find a kernel name in the metadata of your notebook, which is often automatically created when you select a kernel while running a notebook. You can use the kernel_name arugment to specify a kernel. Below is an example:

    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', kernel_name='Python3')

Working Directory: The working directory may be important when your notebook is executed, especially if your notebooks relies on certain files or other assets. You can set the working directory the notebook is executed in with the cwd argument, for example to set the working directory to data/:

    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb', cwd='data/')

Dependency Management

If you are running your flow remotely, you must remember to include the depdendencies for this notebook card itself! One way to do this is with the @conda decorator:

    @conda(libraries={'metaflow-card-notebook':'1.0.1'}) # use the right version number, this is just illustrative.
    @card(type='notebook')
    @step
    def end(self):
        self.nb_options_dict = dict(input_path='nbflow.ipynb')

Remote Execution

If you are running steps remotely, for example with @batch, you must ensure that youre notebooks are uploaded to the remote environment with the cli argument --package-suffixes=".ipynb" For example, to execute example/model_dashboard.py with this argument:

(cd example && python model_dashboard.py --package-suffixes=".ipynb" run)

Project details

Release history Release notifications | RSS feed

1.0.7

Jul 5, 2022

1.0.5

Feb 11, 2022

1.0.4

Feb 11, 2022

1.0.3

Feb 11, 2022

1.0.2

Feb 11, 2022

1.0.1

Jan 26, 2022

This version

1.0.0

Jan 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow-card-notebook-1.0.0.tar.gz (6.7 kB view hashes)

Uploaded Jan 25, 2022 Source

Built Distribution

metaflow_card_notebook-1.0.0-py3-none-any.whl (6.5 kB view hashes)

Uploaded Jan 25, 2022 Python 3

Hashes for metaflow-card-notebook-1.0.0.tar.gz

Hashes for metaflow-card-notebook-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b1c155a3d2e86b9671923f309aacaf3184a1cb0a41f83d790ccaa6f1916f23a8`
MD5	`84afebfff3f539f3230325aa9312f82f`
BLAKE2b-256	`f7907f1b2c2f3ae13bc3260b7e731317c55f34ba26342b5641545ef731ddc98c`

Hashes for metaflow_card_notebook-1.0.0-py3-none-any.whl

Hashes for metaflow_card_notebook-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfb44a904574c754f83f60e19a455bfc16f0cf3a518fcf89a137ecd80bff8de8`
MD5	`0d4d42e42afc8e9263235d9e71f1369b`
BLAKE2b-256	`6b87a3da0c8bdd9955f966a115b6599761061214a05b19e338265a19d0b03263`