Convert notebooks to modular code
Project description
nbmodular
Convert data science notebooks with poor modularity to fully modular notebooks that are automatically exported as python modules.
Motivation
In data science, it is usual to develop experimentally and quickly based on notebooks, with little regard to software engineering practices and modularity. It can become challenging to start working on someone else’s notebooks with no modularity in terms of separate functions, and a great degree of duplicated code between the different notebooks. This makes it difficult to understand the logic in terms of semantically separate units, see what are the commonalities and differences between the notebooks, and be able to extend, generalize, and configure the current solution.
Objectives
nbmodular
is a library conceived with the objective of helping
converting the cells of a notebook into separate functions with clear
dependencies in terms of inputs and outputs. This is done though a
combination of tools which semi-automatically understand the data-flow
in the code, based on mild assumptions about its structure. It also
helps test the current logic and compare it against a modularized
solution, to make sure that the refactored code is equivalent to the
original one.
Features
- Convert cells to functions.
- The logic of a single function can be written across multiple cells.
- Optional: processed cells can continue to operate as cells or be only used as functions from the moment they are converted.
- Create an additional pipeline function that provides the data-flow from the first to the last function call in the notebook.
- Write all the notebook functions to a separate python module.
- Compare the result of the pipeline with the result of running the original notebook.
- Converted functions act as nodes in a dependency graph. These nodes can optionally hold the values of local variables for inspection outside of the function. This is similar to having a single global scope, which is the original situation. Since this is memory-consuming, it is optional and may not be the default.
- Optional: Once we are able to construct a graph, we may be able to draw it or show it in text, and pass it to ADG processors that can run functions sequentially or in parallel.
- Persist the inputs and outputs of functions, so that we may decide to reuse previous results without running the whole notebook.
- Optional: if we have the dependency graph and persisted inputs / outputs, we may decide to only run those cells that are predecessors of the current one, i.e., the ones that provide the inputs needed by the current cell.
- Optional: if we associate a hash code to input data, we may only run the cells when the input data changes. Similarly, if we associate a hash code with AST-converted function code, we may only run those cells whose code has been updated.
- Optional: have a mechanism for indicating test examples that go into different test python files. = Optional: the output of a test cell can be used for assertions, where we require that the current output is the same as the original one.
Roadmap
- Convert cell code into functions:
- Inputs are those variables detected in current cell and also detected in previous cells. This solution requires that created variables have unique names across the notebook. However, even if a new variable with the same name is defined inside the cell, the resulting function is still correct.
- Outputs are, at this moment, all the variables detected in current cell that are also detected in posterior cells.
- Filter out outputs:
- Variables detected in current cell, and also detected in previous
cells, might not be needed as outputs of the current cell, if the
current cell doesn’t modify those variables. To detect potential
modifications:
- AST:
- If variable appears only on the right of assign statements or in if statements.
- If it appears only as argument of functions which we know don’t
modify the variable, such as
print
.
- Comparing variable values before and after cell:
- Good for small variables where doing a deep copy is not computationally expensive.
- Using type checker:
- Making the variable
Final
and using mypy or other type checker to see if it is modified in the code.
- Making the variable
- AST:
- Provide hints:
- Variables that come from other cells might not be needed as output. The remaining are most probably needed.
- Variables that are modified are clearly needed.
- Variables detected in current cell, and also detected in previous
cells, might not be needed as outputs of the current cell, if the
current cell doesn’t modify those variables. To detect potential
modifications:
Install
pip install nbmodular
Usage
Load ipython extension
This allows us to use the following of magic commands, among others
- function <name_of_function_to_define>
- print <name_of_previous_function>
- function_info <name_of_previous_function>
- print_pipeline
Let’s go one by one
function
Use magic command function
allows to:
- Run the code in the cell normally, and at the same time detect its input and output dependencies and define a function with this input and output:
a = 2
b = 3
c = a+b
print (a+b)
5
The code in the previous cell runs as it normally would, but and at the
same time defines a function named get_initial_values
which we can
show with the magic command print
:
</code></pre>
<pre><code>def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
This function is defined in the notebook space, so we can invoke it:
</code></pre>
<pre><code>def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
The inputs and outputs of the function change dynamically every time we
add a new function cell. For example, if we add a new function get_d
:
d = 10
</code></pre>
<pre><code>def get_d():
d = 10
And then a function add_all
that depend on the previous two functions:
a = a + d
b = b + d
c = c + d
f = %function_info add_all
print(f.code)
def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
</code></pre>
<pre><code>def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
</code></pre>
<pre><code>from sklearn.utils import Bunch
from pathlib import Path
import joblib
import pandas as pd
import numpy as np
def test_index_pipeline (test=True, prev_result=None, result_file_name="index_pipeline"):
result = index_pipeline (test=test, load=True, save=True, result_file_name=result_file_name)
if prev_result is None:
prev_result = index_pipeline (test=test, load=True, save=True, result_file_name=f"test_{result_file_name}")
for k in prev_result:
assert k in result
if type(prev_result[k]) is pd.DataFrame:
pd.testing.assert_frame_equal (result[k], prev_result[k])
elif type(prev_result[k]) is np.array:
np.testing.assert_array_equal (result[k], prev_result[k])
else:
assert result[k]==prev_result[k]
</code></pre>
<pre><code>def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
# load result
result_file_name += '.pk'
path_variables = Path ("index") / result_file_name
if load and path_variables.exists():
result = joblib.load (path_variables)
return result
a, c, b = get_initial_values (test=test)
d = get_d ()
add_all (d, a, c, b)
# save result
result = Bunch (a=a,c=c,b=b,d=d)
if save:
path_variables.parent.mkdir (parents=True, exist_ok=True)
joblib.dump (result, path_variables)
return result
</code></pre>
<pre><code>def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
We can see that the uputs from get_initial_values
and get_d
change
as needed. We can look at all the functions defined so far by using
print all
:
</code></pre>
<pre><code>def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
return a,c,b
def get_d():
d = 10
return d
def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
Similarly the outputs from the last function add_all
change after we
add a other functions that depend on it:
print (a, b, c, d)
12 13 15 10
We can see each of the defined functions with print my_function
, and
list all of them with print all
</code></pre>
<pre><code>def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
return a,c,b
def get_d():
d = 10
return d
def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
return a,c,b
def print_all(a, d, c, b):
print (a, b, c, d)
print_pipeline
As we add functions to the notebook, a pipeline function is defined. We
can print this pipeline with the magic print_pipeline
</code></pre>
<pre><code>def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):
# load result
result_file_name += '.pk'
path_variables = Path ("index") / result_file_name
if load and path_variables.exists():
result = joblib.load (path_variables)
return result
a, c, b = get_initial_values (test=test)
d = get_d ()
a, c, b = add_all (d, a, c, b)
print_all (a, d, c, b)
# save result
result = Bunch (a=a,c=c,b=b,d=d)
if save:
path_variables.parent.mkdir (parents=True, exist_ok=True)
joblib.dump (result, path_variables)
return result
This shows the data flow in terms of inputs and outputs
And run it:
self = %cell_processor
self.function_list
[FunctionProcessor with name get_initial_values, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'all_variables', 'code'])
Arguments: []
Output: ['a', 'c', 'b']
Locals: dict_keys(['a', 'b', 'c']),
FunctionProcessor with name get_d, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'all_variables', 'code'])
Arguments: []
Output: ['d']
Locals: dict_keys(['d']),
FunctionProcessor with name add_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'all_variables', 'code'])
Arguments: ['d', 'a', 'c', 'b']
Output: ['a', 'c', 'b']
Locals: dict_keys(['a', 'b', 'c']),
FunctionProcessor with name print_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'all_variables', 'code'])
Arguments: ['a', 'd', 'c', 'b']
Output: []
Locals: dict_keys([])]
</code></pre>
<pre><code>def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
return a,c,b
def get_d():
d = 10
return d
def add_all(d, a, c, b):
a = a + d
b = b + d
c = c + d
return a,c,b
def print_all(a, d, c, b):
print (a, b, c, d)
index_pipeline()
{'d': 10, 'b': 13, 'a': 12, 'c': 15}
function_info
We can get access to many of the details of each of the defined
functions by calling function_info
on a given function name:
get_initial_values_info = %function_info get_initial_values
This allows us to see:
- The name and value (at the time of running) of the local variables, arguments and results from the function:
get_initial_values_info.arguments
[]
get_initial_values_info.current_values
{'a': 2, 'b': 3, 'c': 5}
get_initial_values_info.return_values
['a', 'c', 'b']
We can also inspect the original code written in the cell…
print (get_initial_values_info.original_code)
a = 2
b = 3
c = a+b
print (a+b)
the code of the defined function:
print (get_initial_values_info.code)
def get_initial_values(test=False):
a = 2
b = 3
c = a+b
print (a+b)
return a,c,b
.. and the AST trees:
print (get_initial_values_info.get_ast (code=get_initial_values_info.original_code))
Module(
body=[
Assign(
targets=[
Name(id='a', ctx=Store())],
value=Constant(value=2)),
Assign(
targets=[
Name(id='b', ctx=Store())],
value=Constant(value=3)),
Assign(
targets=[
Name(id='c', ctx=Store())],
value=BinOp(
left=Name(id='a', ctx=Load()),
op=Add(),
right=Name(id='b', ctx=Load()))),
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
BinOp(
left=Name(id='a', ctx=Load()),
op=Add(),
right=Name(id='b', ctx=Load()))],
keywords=[]))],
type_ignores=[])
None
print (get_initial_values_info.get_ast (code=get_initial_values_info.code))
Module(
body=[
FunctionDef(
name='get_initial_values',
args=arguments(
posonlyargs=[],
args=[
arg(arg='test')],
kwonlyargs=[],
kw_defaults=[],
defaults=[
Constant(value=False)]),
body=[
Assign(
targets=[
Name(id='a', ctx=Store())],
value=Constant(value=2)),
Assign(
targets=[
Name(id='b', ctx=Store())],
value=Constant(value=3)),
Assign(
targets=[
Name(id='c', ctx=Store())],
value=BinOp(
left=Name(id='a', ctx=Load()),
op=Add(),
right=Name(id='b', ctx=Load()))),
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
BinOp(
left=Name(id='a', ctx=Load()),
op=Add(),
right=Name(id='b', ctx=Load()))],
keywords=[])),
Return(
value=Tuple(
elts=[
Name(id='a', ctx=Load()),
Name(id='c', ctx=Load()),
Name(id='b', ctx=Load())],
ctx=Load()))],
decorator_list=[])],
type_ignores=[])
None
Now, we can define another function in a cell that uses variables from the previous function.
cell_processor
This magic allows us to get access to the CellProcessor class managing the logic for running the above magic commands, which can become handy:
cell_processor = %cell_processor
Merging function cells
In order to explore intermediate results, it is convenient to split the
code in a function among different cells. This can be done by passing
the flag --merge True
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]
z
[101, 202, 303]
</code></pre>
<pre><code>def analyze():
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]
product = [u*v for u, v in zip(x,y)]
</code></pre>
<pre><code>def analyze():
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]
product = [u*v for u, v in zip(x,y)]
Test functions
By passing the flag --test
we can indicate that the logic in the cell
is dedicated to test other functions in the notebook. The test function
is defined taking the well-known pytest
library as a test engine in
mind.
This has the following consequences:
- The analysis of dependencies is not associated with variables found in other cells. - Test functions do not appear in the overall pipeline. - The data variables used by the test function can be defined in separate test data cells which in turn are converted to functions. These functions are called at the beginning of the test cell.
Let’s see an example
a = 5
b = 3
c = 6
d = 7
add_all(d, a, b, c)
(12, 10, 13)
# test function add_all
assert add_all(d, a, b, c)==(12, 10, 13)
</code></pre>
<pre><code>def test_add_all():
a,c,b,d = test_input_add_all()
# test function add_all
assert add_all(d, a, b, c)==(12, 10, 13)
</code></pre>
<pre><code>def test_input_add_all(test=False):
a = 5
b = 3
c = 6
d = 7
return a,c,b,d
Test functions are written in a separate test module, withprefix test_
!ls ../tests
index.ipynb test_example.py
Imports
In order to include libraries in our python module, we can use the magic imports. Those will be written at the beginning of the module:
import pandas as pd
Imports can be indicated separately for the test module by passing the
flag --test
:
import matplotlib.pyplot as plt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nbmodular-0.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e76304113bd8ddf365d3ca6a5a8fffeeaec2c7ff61416db781cc109279a2803b |
|
MD5 | 6d0e8e8a408e3250fe72afcdb7dec37b |
|
BLAKE2b-256 | 1c3f699348aab4d90e85af40e51fd0ca4eb3343c9441b9da256b2631be260516 |