Skip to main content

Spend your time discovering insights from data, not writing plumbing code. Declare your pipeline in a short YAML file and Ploomber will take care of the rest.

Project description

Ploomber

https://travis-ci.org/ploomber/ploomber.svg?branch=master Documentation Status https://mybinder.org/badge_logo.svg

Point Ploomber to your Python and SQL scripts in a pipeline.yaml file and it will figure out execution order by extracting dependencies from them.

It also keeps track of source code changes to speed up builds by skipping up-to-date tasks. This is a great way to interactively develop your projects, sync work with your team and quickly recover from crashes (just fix the bug and build again).

Try out the live demo (no installation required).

Click here for documentation.

Our blog.

Works with Python 3.5 and higher.

pipeline.yaml example

# pipeline.yaml

# clean data from the raw table
- source: clean.sql
  product: clean_data
  # function that returns a db client
  client: db.get_client

# aggregate clean data
- source: aggregate.sql
  product: agg_data
  client: db.get_client

# dump data to a csv file
- class: SQLDump
  source: dump_agg_data.sql
  product: output/data.csv
  client: db.get_client

# visualize data from csv file
- source: plot.py
  product:
    # where to save the executed notebook
    nb: output/executed-notebook-plot.ipynb
    # tasks can generate other outputs
    data: output/some_data.csv

Python script example

# annotated python file (it will be converted to a notebook during execution)
import pandas as pd

# + tags=["parameters"]
# this script depends on the output generated by a task named "clean"
upstream = {'clean': None}
product = None

# during execution, a new cell is added here

# +
df = pd.read_csv(upstream['some_task'])
# do data processing...
df.to_csv(product['data'])

SQL script example

DROP TABLE IF EXISTS {{product}};

CREATE TABLE {{product}} AS
-- this task depends on the output generated by a task named "clean"
SELECT * FROM {{upstream['clean']}}
WHERE x > 10

To run your pipeline:

ploomber entry pipeline.yaml

If you build again, tasks whose source code is the same (and all upstream dependencies) are skipped.

Start an interactive session (note the double dash):

ipython -i -m ploomber.entry pipeline.yaml -- --action status

During an interactive session:

# visualize dependencies
dag.plot()

# develop your Python script interactively
dag['task'].develop()

# line by line debugging
dag['task'].debug()

Install

pip install ploomber

To install Ploomber along with all optional dependencies:

pip install "ploomber[all]"

graphviz is required for plotting pipelines:

# if you use conda (recommended)
conda install graphviz
# if you use homebrew
brew install graphviz
# for more options, see: https://www.graphviz.org/download/

Create a project with basic structure

ploomber new

Python API

There is also a Python API for advanced use cases. This API allows you build flexible abstractions such as dynamic pipelines, where the exact number of tasks is determined by its parameters.

CHANGELOG

0.5.1dev

  • Experimental PythonCallable.develop()

0.5 (2020-06-27)

  • Adds CLI

  • New spec API to instantiate DAGs using YAML files

  • NotebookRunner.debug() for debugging and .develop() for interacive development

  • Bug fixes

0.4.1 (2020-05-19)

  • PythonCallable.debug() now works in Jupyter notebooks

0.4.0 (2020-05-18)

  • PythonCallable.debug() now uses IPython debugger by default

  • Improvements to Task.build() public API

  • Moves hook triggering logic to Task to simplify executors implementation

  • Adds DAGBuildEarlyStop exception to signal DAG execution stop

  • New option in Serial executor to turn warnings and exceptions capture off

  • Adds Product.prepare_metadata hook

  • Implements hot reload for notebooks and python callables

  • General clean ups for old __str__ and __repr__ in several modules

  • Refactored ploomber.sources module and ploomber.placeholders (previously ploomber.templates)

  • Adds NotebookRunner.debug() and NotebookRunner.develop()

  • NotebookRunner: now has an option to run static analysis on render

  • Adds documentation for DAG-level hooks

  • Bug fixes

0.3.5 (2020-05-03)

  • Bug fixes #88, #89, #90, #84, #91

  • Modifies Env API: Env() is now Env.load(), Env.start() is now Env()

  • New advanced Env guide added to docs

  • Env can now be used with a context manager

  • Improved DAGConfigurator API

  • Deletes logger configuration in executors constructors, logging is available via DAGConfigurator

0.3.4 (2020-04-25)

  • Dependencies cleanup

  • Removed (numpydoc) as dependency, now optional

  • A few bug fixes: #79, #71

  • All warnings are captured and shown at the end (Serial executor)

  • Moves differ parameter from DAG constructor to DAGConfigurator

0.3.3 (2020-04-23)

  • Cleaned up some modules, deprecated some rarely used functionality

  • Improves documentation aimed to developers looking to extend ploomber

  • Introduces DAGConfigurator for advanced DAG configuration [Experimental API]

  • Adds task to upload files to S3 (ploomber.tasks.UploadToS3), requires boto3

  • Adds DAG-level on_finish and on_failure hooks

  • Support for enabling logging in entry points (via –logging)

  • Support for starting an interactive session using entry points (via python -i -m)

  • Improved support for database drivers that can only send one query at a time

  • Improved repr for SQLAlchemyClient, shows URI (but hides password)

  • PythonCallable now validates signature against params at render time

  • Bug fixes

0.3.2 (2020-04-07)

  • Faster Product status checking, now performed at rendering time

  • New products: GenericProduct and GenericSQLRelation for Products that do not have a specific implementation (e.g. you can use Hive with the DBAPI client + GenericSQLRelation)

  • Improved DAG build reports, subselect columns, transform to pandas.DataFrame and dict

  • Parallel executor now returns build reports, just like the Serial executor

0.3.1 (2020-04-01)

  • DAG parallel executor

  • Interact with pipelines from the command line (entry module)

  • Bug fixes

  • Refactored access to Product.metadata

0.3 (2020-03-20)

  • New Quickstart and User Guide section in documentation

  • DAG rendering and build now continue until no more tasks can render/build (instead of failing at the first exception)

  • New @with_env and @load_env decorators for managing environments

  • Env expansion ({{user}} expands to the current, also {{git}} and {{version}} available)

  • Task.name is now optional when Task is initialized with a source that has __name__ attribute (Python functions) or a name attribute (like Placeholders returned from SourceLoader)

  • New Task.on_render hook

  • Bug fixes

  • A lot of new tests

  • Now compatible with Python 3.5 and higher

0.2.1 (2020-02-20)

  • Adds integration with pdb via PythonCallable.debug

  • Env.start now accepts a filename to look for

  • Improvements to data_frame_validator

0.2 (2020-02-13)

  • Simplifies installation

  • Deletes BashCommand, use ShellScript

  • More examples added

  • Refactored env module

  • Renames SQLStore to SourceLoader

  • Improvements to SQLStore

  • Improved documentation

  • Renamed PostgresCopy to PostgresCopyFrom

  • SQLUpload and PostgresCopy have now the same API

  • A few fixes to PostgresCopy (#1, #2)

0.1

  • First release

Project details


Release history Release notifications | RSS feed

This version

0.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ploomber-0.5.tar.gz (108.3 kB view hashes)

Uploaded Source

Built Distribution

ploomber-0.5-py3-none-any.whl (138.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page