Rapid-iteration data pipelines with automagic caching + parameter passing.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

flonb

flonb lets you develop data pipelines (a.k.a task graphs) with very rapid "modify -> run -> inspect" iteration loops.

It lets you:

Label each run of the pipeline with a human-readable list of options. Easy to see what you did to the data, easy to reproduce.
Cache results automagically based on the labels. Change an option, add a new option - flonb will manage the cache without extra boilerplate!
Parallel execution of your pipeline. flonb pipelines are interoperable with dask schedulers.

So you can spend more time focussing on your logic, and less time on error-prone boilerplate code.

Installation

$ pip install flonb

Note this also installs dask.core. If you require dask extensions (e.g. dask.bag, dask.dataframe, dask.delayed, etc...) then make sure to install them yourself. It can be a bit unclear with dask what extensions you have installed! https://docs.dask.org/en/latest/install.html

Example basic usage

Here is a simple pipeline that counts the number of times a word appears in a block of text.

Define a Task Graph using:

flonb.task_func decorator for each processing step.
flonb.Dep as the default parameter value to tell flonb which tasks to use as inputs.

import flonb


@flonb.task_func
def parse_text(normalise):
    text = (
        "Badger badger badger Mushroom mushroom "
        "Badger badger badger Mushroom mushroom "
        "Badger badger badger SNAKE its a SNAKE "
    )
    words = text.split(" ")
    if normalise:
        words = [w.lower() for w in words]
    return words


@flonb.task_func
def word_count(word, text=flonb.Dep(parse_text)):
    return sum([w == word for w in text])

Run the Task Graph using the .compute method that @flonb.task_func adds to your function.

Note that we supply all the options that specify the entire Task Graph. Even though the function word_count does not explicitly require the option normalise, its dependency parse_text does.

word_count.compute(normalise=False, word="badger")

You can still run your pipeline by manually calling the functions the usual way. This can be handy for debugging:

text = parse_text(normalise=False)
word_count(word="badger", text=text)

Change the options, and everything works like you'd expect.

word_count.compute(normalise=True, word="badger")

Importantly, you can use a single option in multiple places in the same pipeline:

@flonb.task_func
def summary_str(word, normalise, count=flonb.Dep(word_count)):
    return f"word={word}, normalise={normalise} -> count={count}"

print(summary_str.compute(word="mushroom", normalise=True))

word=mushroom, normalise=True -> count=4

A typical workflow using flonb is:

Run pipeline with different options to see the impacts on your data.
Add new options and/or steps.
Repeat.

flonb is especially useful because you can add new options early in the Task Graph without having to manually propagate the option through every function in the pipeline!

Task Graphs

We can visualize the graph using dask[complete] + graphviz. You can install these with conda:

$ conda install dask python-graphviz

See https://docs.dask.org/en/latest/graphviz.html for other installation options.

import dask
graph, key = word_count.graph_and_key(normalise=False, word="badger")
dask.visualize(graph)

png

list dependencies

Oftentimes you will want to combine multiple computations with different versions of an option.

You can do this by supplying a list to flonb.Dep and using the .partial(**options) method.

@flonb.task_func
def count_many_words(
    counts=flonb.Dep(
        [word_count.partial(word="badger"), word_count.partial(word="mushroom")]
    )
):
    return counts  # so we can see how flonb substitutes the counts flonb.Dep parameter

count_many_words.compute(normalise=False)

[6, 2]

Lists can be nested, and non-flonb.Task objects are ignored:

@flonb.task_func
def count_many_words_heterogenous_nested_list(
    counts=flonb.Dep(
        [
            [word, word_count.partial(word=word)]
            for word in ["badger", "mushroom", "snake", "cow"]
        ]
    )
):
    return counts  # so we can see how flonb substitutes the counts flonb.Dep parameter

count_many_words_heterogenous_nested_list.compute(normalise=True)

[['badger', 9], ['mushroom', 4], ['snake', 2], ['cow', 0]]

Caching

flonb uses the options supplied to a task to uniquely identify it, and can automagically cache on disk the results using this identifier.

If a task has been cached, flonb will fetch the cached result instead of computing not only the task but also any of its dependencies.

Warning: If you invalidate your cache by changing what an option means (e.g. fixing a bug in your code), make sure to delete your cache!

flonb.set_cache_dir("/tmp")

@flonb.task_func(cache_disk=True)
def power(x, y):
    print(f"computing {x}**{y}")  # side effect - doesn't happen when cached result called
    return x**y


# only prints once
power.compute(x=3, y=3)
power.compute(x=3, y=3)

computing 3**3

# each unique set of options labels the cache
power.compute(x=3, y=2)
power.compute(x=2, y=3)

computing 3**2
computing 2**3

Dynamic dependencies

Sometimes you want to know the value of an option before you resolve the dependencies. flonb.DynamicDep has your back here.

@flonb.task_func
def add(x, y):
    return x + y

@flonb.task_func
def multiply(a, b):
    return a * b

@flonb.task_func
def multiply_or_add(
    result=flonb.DynamicDep(lambda mode: {"add": add, "multiply": multiply}[mode])
):
    return result

Here the option 'mode' determines what the task graph looks like, and even what options are available:

multiply_or_add.compute(mode="multiply", a=4, b=3)

multiply_or_add.compute(mode="add", x=4, y=3)

Multiproccesing

flonb implements the dask Task Graph specification.

call .graph_and_key() on your task.
feed into dask as required.

import dask.multiprocessing
import time

@flonb.task_func
def add_one_in_one_second(x):
    time.sleep(1)  # artifically slow down the task
    result = x + 1
    return result

@flonb.task_func
def do_sums(data=flonb.Dep([add_one_in_one_second.partial(x=x) for x in range(10)])):
    return data


graph, key = do_sums.graph_and_key()

# single process
tic = time.time()
dask.get(graph, key)
print(f"Single-process execution took {time.time() - tic:.2f} seconds.")

print("------------")

# multiple process
tic = time.time()
dask.multiprocessing.get(graph, key, num_workers=4)
print(f"Multiproccessing execution took {time.time() - tic:.2f} seconds.")

Single-process execution took 10.01 seconds.
------------
Multiproccessing execution took 3.30 seconds.

Alternatives

There are many high quality frameworks that let you build and run task graphs. flonb is lightweight and easy to experiment with, but make sure to check out others if you want to delve further into your options. Here are some suggestions:

dask
luigi
airflow

Acknowledgements

A big thanks to Solcast (https://solcast.com) for supporting the development of flonb.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Feb 22, 2021

0.1.1

Nov 19, 2020

0.1.0

Nov 2, 2020

0.0.1

Sep 30, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flonb-0.1.2.tar.gz (11.5 kB view hashes)

Uploaded Feb 22, 2021 Source

Built Distribution

flonb-0.1.2-py3-none-any.whl (13.0 kB view hashes)

Uploaded Feb 22, 2021 Python 3

Hashes for flonb-0.1.2.tar.gz

Hashes for flonb-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a01b0cd3ad158ba3b0b3fd263d5fda58a7aa544ca21c93bebd5db5262b6a674b`
MD5	`38a846f8cb708d6695b223d395f81b22`
BLAKE2b-256	`e78754a45cafa93abf5d0df071bef49263f67d265618b232a47af953299ae5d7`

Hashes for flonb-0.1.2-py3-none-any.whl

Hashes for flonb-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7b04252fe4fc6d4a1f9a21d2ab7f61728c28baaf7f8dcaa5c87121ace121084`
MD5	`b8b9f5bb006c28ff60f4bf8b82b81bec`
BLAKE2b-256	`dde9446d8ac93512f012c33e74c3e5d50f66b3e42651967e34b7eb6aecee8515`