Skip to main content

Write better data pipelines without having to learn a specialized framework. By adopting a convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently develop data products.

Project description

Ploomber

https://travis-ci.org/ploomber/ploomber.svg?branch=master Documentation Status https://mybinder.org/badge_logo.svg https://badge.fury.io/py/ploomber.svg https://coveralls.io/repos/github/ploomber/ploomber/badge.svg?branch=master

Coding an entire analysis pipeline in a single notebook file allows you to develop your code interactively, but it creates an unmaintainable monolith that easily breaks. Ploomber allows you to modularize your analysis in smaller tasks without losing the power of an interactive notebook.

Imagine you have a pipeline that gets (get.ipynb), cleans (clean.ipynb) and plots (plot.ipynb) data. All you have to do to turn this into a data pipeline is to declare a special cell at the top of your notebook with dependencies and output files:

# top cell in clean.ipynb

# get.ipynb must run before clean.ipynb
upstream = ['get']
# output files generated by clean.ipynb
product = {'data': 'output/clean.csv'}

That’s it! Execute ploomber build and your pipeline tasks will execute in the right order.

Main features

1. Jupyter integration. When you open your notebooks, Ploomber will automatically inject a new cell with the location of your input files, as inferred from your upstream variable. If you open a Python or R script, it will be converted to a notebook on the fly.

2. Incremental builds. Speed up execution by skipping tasks whose source code hasn’t changed.

3. Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g. values within expected range).

4. Pipeline inspection. Start an interactive session with ploomber interact to debug your pipeline. Call dag['task_name'].debug() to start a debugging session.

Try it out

# clone the sample projects
git clone https://github.com/ploomber/projects

# move to the machine learning pipeline example
cd projects/ml-basic

# install dependencies
# 1) if you have conda installed
conda env create -f environment.yml
conda activate ml-basic
# 2) if you don't have conda
pip install ploomber pandas scikit-learn pyarrow sklearn-evaluation

# create output folder
mkdir output

# run the pipeline
ploomber build

When execution finishes, you’ll see the output in the output/ folder.

Installation

pip install ploomber

Compatible with Python 3.6 and higher.

Resources

CHANGELOG

0.8.2 (2020-10-31)

  • Removes matplotlib from dependencies, now using IPython.display for inline plotting

  • Fixes bug that caused custom args to {PythonCallable, NotebookRunner}.develop(args="--arg=value") not to be sent correctly to the subprocess

  • NotebookRunner (initialized from ipynb) only considers the actual code as its source, ignores the rest of the JSON contents

  • Fixes bug when EnvDict was initialized from another EnvDict

  • PythonCallableSource can be initialized with dotted paths

  • DAGSpec loads env.yaml when initialized with a YAML spec and there is a env.yaml file in the spec parent folder

  • DAGSpec converts relative paths in sources to be so to the project’s root folder

  • Adds lazy_import to DAGspec, to avoid importing PythonCallable sources (passes the dotted paths as strings instead)

0.8.1 (2020-10-18)

  • ploomber interact allows to switch DAG parameters, just like ploomber build

  • Adds PythonCallable.develop() to develop Python functions interactively

  • NotebookRunner.develop() to develop now also works with Jupyter lab

0.8 (2020-10-15)

  • Dropping support for Python 3.5

  • Removes DAGSpec.from_file, loading from a file is now handled directly by the DAGSpec constructor

  • Performance improvements, DAG does not fetch metadata when it doesn’t need to

  • Factory functions: Bool parameters with default values are now represented as flags when called from the CLI

  • CLI arguments to replace values from env.yaml are now built with double hyphens instead of double underscores

  • NotebookRunner creates parent folders for output file if they don’t exist

  • Bug fixes

0.7.5 (2020-10-02)

  • NotebookRunner.develop accepts passing arguments to jupyter notebook

  • Spec API now supports PythonCallable (by passing a dotted path)

  • Upstream dependencies of PythonCallables can be inferred via the extract_upstream option in the Spec API

  • Faster DAG.render(force=True) (avoid checking metadata when possible)

  • Faster notebook rendering when using the extension thanks to the improvement above

  • data_frame_validator improvement: validate_schema can now validate optional columns dtypes

  • Bug fixes

0.7.4 (2020-09-14)

  • Improved __repr__ methods in PythonCallableSource and NotebookSource

  • Improved output layout for tables

  • Support for nbconvert>=6

  • “Docstrings” are parsed from notebooks and displayed in DAG status table (#242)

  • Jupyter extension now works for DAGs defined via directories (via ENTRY_POINT env variable)

  • Adds Jupyter integration guide to documentation

  • Several bug fixes

0.7.3 (2020-08-19)

  • Improved support for R notebooks (.Rmd)

  • New section for testing.sql module in the documentation

0.7.2 (2020-08-17)

  • New guides: parametrized pipelines, SQL templating, pipeline testing and debugging

  • NotebookRunner.debug(kind='pm') for post-mortem debugging

  • Fixes bug in Jupyter extension when the pipeline has a task whose source is not a file (e.g. SQLDump)

  • Fixes a bug in the CLI custom arg parser that caused dynamic params not to show up

  • DAGspec now supports SourceLoader

  • Docstring (from dotted path entry point) is shown in the CLI summary

  • Customized sphinx build to execute guides from notebooks

0.7.1 (2020-08-06)

  • Support for R

  • Adding section on R pipeline to the documentation

  • Construct pipeline from a directory (no need to write a pipeline.yaml file)

  • Improved error messages when DAG fails to initialize (jupyter notebook app)

  • Bug fixes

  • CLI accepts factory function parameters as positional arguments, types are inferred using type hints, displayed when calling --help

  • CLI accepts env variables (if any), displayed when calling --help

0.7 (2020-07-30)

  • Simplified CLI (breaking changes)

  • Refactors internal API for notebook conversion, adds tests for common formats

  • Metadata is deleted when saving a script from the Jupyter notebook app to make sure the task runs in the next pipeline build

  • SQLAlchemyClient now supports custom tokens to split source

0.6.3 (2020-07-24)

  • Adding –log option to CLI commands

  • Fixes a bug that caused the dag variable not to be exposed during interactive sessions

  • Fixes ploomber task forced run

  • Adds SQL pipeline tutorial to get started docs

  • Minor CSS changes to docs

0.6.2 (2020-07-22)

  • Support for env.yaml in pipeline.yaml

  • Improved CLI. Adds plot, report and task commands

0.6.1 (2020-07-20)

  • Changes pipeline.yaml default (extract_product: True)

  • Documentation re-design

  • Simplified “ploomber new” generated files

  • Ability to define “product” in SQL scripts

  • Products are resolved to absolute paths to avoid ambiguity

  • Bug fixes

0.6 (2020-07-08)

  • Adds Jupyter notebook extension to inject parameters when opening a task

  • Improved CLI ploomber new, ploomber add and ploomber entry

  • Spec API documentation additions

  • Support for on_finish, on_failure and on_render hooks in spec API

  • Improved validation for DAG specs

  • Several bug fixes

0.5.1 (2020-06-30)

  • Reduces the number of required dependencies

  • A new option in DBAPIClient to split source with a custom separator

0.5 (2020-06-27)

  • Adds CLI

  • New spec API to instantiate DAGs using YAML files

  • NotebookRunner.debug() for debugging and .develop() for interacive development

  • Bug fixes

0.4.1 (2020-05-19)

  • PythonCallable.debug() now works in Jupyter notebooks

0.4.0 (2020-05-18)

  • PythonCallable.debug() now uses IPython debugger by default

  • Improvements to Task.build() public API

  • Moves hook triggering logic to Task to simplify executors implementation

  • Adds DAGBuildEarlyStop exception to signal DAG execution stop

  • New option in Serial executor to turn warnings and exceptions capture off

  • Adds Product.prepare_metadata hook

  • Implements hot reload for notebooks and python callables

  • General clean ups for old __str__ and __repr__ in several modules

  • Refactored ploomber.sources module and ploomber.placeholders (previously ploomber.templates)

  • Adds NotebookRunner.debug() and NotebookRunner.develop()

  • NotebookRunner: now has an option to run static analysis on render

  • Adds documentation for DAG-level hooks

  • Bug fixes

0.3.5 (2020-05-03)

  • Bug fixes #88, #89, #90, #84, #91

  • Modifies Env API: Env() is now Env.load(), Env.start() is now Env()

  • New advanced Env guide added to docs

  • Env can now be used with a context manager

  • Improved DAGConfigurator API

  • Deletes logger configuration in executors constructors, logging is available via DAGConfigurator

0.3.4 (2020-04-25)

  • Dependencies cleanup

  • Removed (numpydoc) as dependency, now optional

  • A few bug fixes: #79, #71

  • All warnings are captured and shown at the end (Serial executor)

  • Moves differ parameter from DAG constructor to DAGConfigurator

0.3.3 (2020-04-23)

  • Cleaned up some modules, deprecated some rarely used functionality

  • Improves documentation aimed to developers looking to extend ploomber

  • Introduces DAGConfigurator for advanced DAG configuration [Experimental API]

  • Adds task to upload files to S3 (ploomber.tasks.UploadToS3), requires boto3

  • Adds DAG-level on_finish and on_failure hooks

  • Support for enabling logging in entry points (via –logging)

  • Support for starting an interactive session using entry points (via python -i -m)

  • Improved support for database drivers that can only send one query at a time

  • Improved repr for SQLAlchemyClient, shows URI (but hides password)

  • PythonCallable now validates signature against params at render time

  • Bug fixes

0.3.2 (2020-04-07)

  • Faster Product status checking, now performed at rendering time

  • New products: GenericProduct and GenericSQLRelation for Products that do not have a specific implementation (e.g. you can use Hive with the DBAPI client + GenericSQLRelation)

  • Improved DAG build reports, subselect columns, transform to pandas.DataFrame and dict

  • Parallel executor now returns build reports, just like the Serial executor

0.3.1 (2020-04-01)

  • DAG parallel executor

  • Interact with pipelines from the command line (entry module)

  • Bug fixes

  • Refactored access to Product.metadata

0.3 (2020-03-20)

  • New Quickstart and User Guide section in documentation

  • DAG rendering and build now continue until no more tasks can render/build (instead of failing at the first exception)

  • New @with_env and @load_env decorators for managing environments

  • Env expansion ({{user}} expands to the current, also {{git}} and {{version}} available)

  • Task.name is now optional when Task is initialized with a source that has __name__ attribute (Python functions) or a name attribute (like Placeholders returned from SourceLoader)

  • New Task.on_render hook

  • Bug fixes

  • A lot of new tests

  • Now compatible with Python 3.5 and higher

0.2.1 (2020-02-20)

  • Adds integration with pdb via PythonCallable.debug

  • Env.start now accepts a filename to look for

  • Improvements to data_frame_validator

0.2 (2020-02-13)

  • Simplifies installation

  • Deletes BashCommand, use ShellScript

  • More examples added

  • Refactored env module

  • Renames SQLStore to SourceLoader

  • Improvements to SQLStore

  • Improved documentation

  • Renamed PostgresCopy to PostgresCopyFrom

  • SQLUpload and PostgresCopy have now the same API

  • A few fixes to PostgresCopy (#1, #2)

0.1

  • First release

Project details


Release history Release notifications | RSS feed

This version

0.8.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ploomber-0.8.2.tar.gz (146.3 kB view hashes)

Uploaded Source

Built Distribution

ploomber-0.8.2-py3-none-any.whl (203.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page