Skip to main content

Write better data pipelines without having to learn a specialized framework. By adopting a convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently develop data products.

Project description

Ploomber

https://travis-ci.org/ploomber/ploomber.svg?branch=master Documentation Status https://mybinder.org/badge_logo.svg https://badge.fury.io/py/ploomber.svg https://coveralls.io/repos/github/ploomber/ploomber/badge.svg?branch=master

Write better data pipelines without having to learn a specialized framework. By adopting a convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently develop data products.

Installation

pip install ploomber

Compatible with Python 3.5 and higher.

Workflow

Assume you have a collection of scripts, where each one is a task in your pipeline.

To execute your pipeline end-to-end:

  1. Inside each script, state dependencies (other scripts) via an upstream variable

  2. Use a product variable to declare output file(s) that the next script will use as inputs

  3. Run ploomber build --entry-point path/to/your/scripts/

Optional: List your tasks in a pipeline.yaml file for more flexibility.

What you get

  1. Automated end-to-end execution

  2. Incremental builds (skip up-to-date tasks)

  3. Integration with Jupyter

  4. Seamlessly integrate SQL with Python/R (i.e. extract data with SQL, plot it with Python/R)

  5. Generate a graphical representation of your pipeline using ploomber plot

  6. Sync teamwork

How it looks like

In Python scripts, declare your parameters like this:

# imports...

# + tag=["parameters"]
upsteam = ['some_task', 'another_task']
product = {'nb': 'path/to/executed/nb.ipynb', 'data': 'path/to/data.csv'}
# -

# actual analysis code...

R scripts:

# imports...

# + tag=["parameters"]
upsteam = list('some_task', 'another_task')
product = list(nb='path/to/executed/nb.ipynb', data='path/to/data.csv'}
# -

# actual analysis code...

Notebook (Python or R):

https://ploomber.io/doc/ipynb-parameters-cell.png

SQL scripts:

{% set product = SQLRelation(['schema', 'name', 'table']) %}

DROP TABLE IF EXISTS {{product}};

CREATE TABLE {{product}} AS
SELECT FROM {{upstream['some_task']}}
JOIN {{upstream['another_task']}}
USING (some_column)

Ploomber uses jinja for generating SQL on the fly. You can leverage existing jinja features to improve SQL code reusability. For example, you can define a SQL snippet and import it in another file using {{placeholders}}.

How it works

  1. Ploomber extracts dependencies from your code to infer execution order

  2. Replaces the original upstream variable with one that maps tasks to their products (Python/R), see example below. Replaces placeholders with the actual table names (SQL)

  3. Tasks are executed

  4. Each script (Python/R) generates an executed notebook for you to review results visually

Example

https://ploomber.io/doc/python/diag.png

Demo

https://asciinema.org/a/346484.svg

Try it out

ploomber new
# follow instructions
cd {project-name}
ploomber build
# see output in the output/ directory

Note: The demo project requires pandas and matplotlib.

Try out the hosted demo (no installation required).

External resources

Python API

There is also a Python API for advanced use cases. This API allows you build flexible abstractions such as dynamic pipelines, where the exact number of tasks is determined by its parameters. More information in the documentation.

CHANGELOG

0.7.2 (2020-08-17)

  • New guides: parametrized pipelines, SQL templating, pipeline testing and debugging

  • NotebookRunner.debug(kind='pm') for post-mortem debugging

  • Fixes bug in Jupyter extension when the pipeline has a task whose source is not a file (e.g. SQLDump)

  • Fixes a bug in the CLI custom arg parser that caused dynamic params not to show up

  • DAGspec now supports SourceLoader

  • Docstring (from dotted path entry point) is shown in the CLI summary

  • Customized sphinx build to execute guides from notebooks

0.7.1 (2020-08-06)

  • Support for R

  • Adding section on R pipeline to the documentation

  • Construct pipeline from a directory (no need to write a pipeline.yaml file)

  • Improved error messages when DAG fails to initialize (jupyter notebook app)

  • Bug fixes

  • CLI accepts factory function parameters as positional arguments, types are inferred using type hints, displayed when calling --help

  • CLI accepts env variables (if any), displayed when calling --help

0.7 (2020-07-30)

  • Simplified CLI (breaking changes)

  • Refactors internal API for notebook conversion, adds tests for common formats

  • Metadata is deleted when saving a script from the Jupyter notebook app to make sure the task runs in the next pipeline build

  • SQLAlchemyClient now supports custom tokens to split source

0.6.3 (2020-07-24)

  • Adding –log option to CLI commands

  • Fixes a bug that caused the dag variable not to be exposed during interactive sessions

  • Fixes ploomber task forced run

  • Adds SQL pipeline tutorial to get started docs

  • Minor CSS changes to docs

0.6.2 (2020-07-22)

  • Support for env.yaml in pipeline.yaml

  • Improved CLI. Adds plot, report and task commands

0.6.1 (2020-07-20)

  • Changes pipeline.yaml default (extract_product: True)

  • Documentation re-design

  • Simplified “ploomber new” generated files

  • Ability to define “product” in SQL scripts

  • Products are resolved to absolute paths to avoid ambiguity

  • Bug fixes

0.6 (2020-07-08)

  • Adds Jupyter notebook extension to inject parameters when opening a task

  • Improved CLI ploomber new, ploomber add and ploomber entry

  • Spec API documentation additions

  • Support for on_finish, on_failure and on_render hooks in spec API

  • Improved validation for DAG specs

  • Several bug fixes

0.5.1 (2020-06-30)

  • Reduces the number of required dependencies

  • A new option in DBAPIClient to split source with a custom separator

0.5 (2020-06-27)

  • Adds CLI

  • New spec API to instantiate DAGs using YAML files

  • NotebookRunner.debug() for debugging and .develop() for interacive development

  • Bug fixes

0.4.1 (2020-05-19)

  • PythonCallable.debug() now works in Jupyter notebooks

0.4.0 (2020-05-18)

  • PythonCallable.debug() now uses IPython debugger by default

  • Improvements to Task.build() public API

  • Moves hook triggering logic to Task to simplify executors implementation

  • Adds DAGBuildEarlyStop exception to signal DAG execution stop

  • New option in Serial executor to turn warnings and exceptions capture off

  • Adds Product.prepare_metadata hook

  • Implements hot reload for notebooks and python callables

  • General clean ups for old __str__ and __repr__ in several modules

  • Refactored ploomber.sources module and ploomber.placeholders (previously ploomber.templates)

  • Adds NotebookRunner.debug() and NotebookRunner.develop()

  • NotebookRunner: now has an option to run static analysis on render

  • Adds documentation for DAG-level hooks

  • Bug fixes

0.3.5 (2020-05-03)

  • Bug fixes #88, #89, #90, #84, #91

  • Modifies Env API: Env() is now Env.load(), Env.start() is now Env()

  • New advanced Env guide added to docs

  • Env can now be used with a context manager

  • Improved DAGConfigurator API

  • Deletes logger configuration in executors constructors, logging is available via DAGConfigurator

0.3.4 (2020-04-25)

  • Dependencies cleanup

  • Removed (numpydoc) as dependency, now optional

  • A few bug fixes: #79, #71

  • All warnings are captured and shown at the end (Serial executor)

  • Moves differ parameter from DAG constructor to DAGConfigurator

0.3.3 (2020-04-23)

  • Cleaned up some modules, deprecated some rarely used functionality

  • Improves documentation aimed to developers looking to extend ploomber

  • Introduces DAGConfigurator for advanced DAG configuration [Experimental API]

  • Adds task to upload files to S3 (ploomber.tasks.UploadToS3), requires boto3

  • Adds DAG-level on_finish and on_failure hooks

  • Support for enabling logging in entry points (via –logging)

  • Support for starting an interactive session using entry points (via python -i -m)

  • Improved support for database drivers that can only send one query at a time

  • Improved repr for SQLAlchemyClient, shows URI (but hides password)

  • PythonCallable now validates signature against params at render time

  • Bug fixes

0.3.2 (2020-04-07)

  • Faster Product status checking, now performed at rendering time

  • New products: GenericProduct and GenericSQLRelation for Products that do not have a specific implementation (e.g. you can use Hive with the DBAPI client + GenericSQLRelation)

  • Improved DAG build reports, subselect columns, transform to pandas.DataFrame and dict

  • Parallel executor now returns build reports, just like the Serial executor

0.3.1 (2020-04-01)

  • DAG parallel executor

  • Interact with pipelines from the command line (entry module)

  • Bug fixes

  • Refactored access to Product.metadata

0.3 (2020-03-20)

  • New Quickstart and User Guide section in documentation

  • DAG rendering and build now continue until no more tasks can render/build (instead of failing at the first exception)

  • New @with_env and @load_env decorators for managing environments

  • Env expansion ({{user}} expands to the current, also {{git}} and {{version}} available)

  • Task.name is now optional when Task is initialized with a source that has __name__ attribute (Python functions) or a name attribute (like Placeholders returned from SourceLoader)

  • New Task.on_render hook

  • Bug fixes

  • A lot of new tests

  • Now compatible with Python 3.5 and higher

0.2.1 (2020-02-20)

  • Adds integration with pdb via PythonCallable.debug

  • Env.start now accepts a filename to look for

  • Improvements to data_frame_validator

0.2 (2020-02-13)

  • Simplifies installation

  • Deletes BashCommand, use ShellScript

  • More examples added

  • Refactored env module

  • Renames SQLStore to SourceLoader

  • Improvements to SQLStore

  • Improved documentation

  • Renamed PostgresCopy to PostgresCopyFrom

  • SQLUpload and PostgresCopy have now the same API

  • A few fixes to PostgresCopy (#1, #2)

0.1

  • First release

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ploomber-0.7.2.tar.gz (130.1 kB view hashes)

Uploaded Source

Built Distribution

ploomber-0.7.2-py3-none-any.whl (183.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page