Write better data pipelines without having to learn a specialized framework. By adopting a convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently develop data products.
Project description
Ploomber
Write better data pipelines without having to learn a specialized framework. By adopting a convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently develop data products.
Installation
pip install ploomber
Compatible with Python 3.5 and higher.
Workflow
Assume you have a collection of scripts, where each one is a task in your pipeline.
To execute your pipeline end-to-end:
Inside each script, state dependencies (other scripts) via an upstream variable
Use a product variable to declare output file(s) that the next script will use as inputs
Run ploomber build --entry-point path/to/your/scripts/
Optional: List your tasks in a pipeline.yaml file for more flexibility.
What you get
Pipeline end-to-end execution
Incremental builds (skip up-to-date tasks)
Integration with Jupyter
Seamlessly integrate SQL with Python/R (i.e. extract data with SQL, plot it with Python/R)
Parametrized pipelines with automatic command line interface generation
How it looks like
In Python scripts, declare your parameters like this:
# imports...
# + tag=["parameters"]
upsteam = ['some_task', 'another_task']
product = {'nb': 'path/to/executed/nb.ipynb', 'data': 'path/to/data.csv'}
# -
# actual analysis code...
R scripts:
# imports...
# + tag=["parameters"]
upsteam = list('some_task', 'another_task')
product = list(nb='path/to/executed/nb.ipynb', data='path/to/data.csv'}
# -
# actual analysis code...
Notebook (Python or R):
SQL scripts:
{% set product = SQLRelation(['schema', 'name', 'table']) %}
DROP TABLE IF EXISTS {{product}};
CREATE TABLE {{product}} AS
SELECT FROM {{upstream['some_task']}}
JOIN {{upstream['another_task']}}
USING (some_column)
How it works
Ploomber extracts dependencies from your code to infer execution order
Replaces the original upstream variable with one that maps tasks to their products (Python/R), see example below. Replaces placeholders with the actual table names (SQL)
Tasks are executed
Each script (Python/R) generates an executed notebook for you to review results visually
Example
Demo
Try it out
ploomber new
# follow instructions
cd {project-name}
ploomber build
# see output in the output/ directory
Note: The demo project requires pandas and matplotlib.
External resources
Python API
There is also a Python API for advanced use cases. This API allows you build flexible abstractions such as dynamic pipelines, where the exact number of tasks is determined by its parameters. More information in the documentation.
CHANGELOG
0.7.5 (2020-10-02)
NotebookRunner.develop accepts passing arguments to jupyter notebook
Spec API now supports PythonCallable (by passing a dotted path)
Upstream dependencies of PythonCallables can be inferred via the extract_upstream option in the Spec API
Faster DAG.render(force=True) (avoid checking metadata when possible)
Faster notebook rendering when using the extension thanks to the improvement above
data_frame_validator improvement: validate_schema can now validate optional columns dtypes
Bug fixes
0.7.4 (2020-09-14)
Improved __repr__ methods in PythonCallableSource and NotebookSource
Improved output layout for tables
Support for nbconvert>=6
“Docstrings” are parsed from notebooks and displayed in DAG status table (#242)
Jupyter extension now works for DAGs defined via directories (via ENTRY_POINT env variable)
Adds Jupyter integration guide to documentation
Several bug fixes
0.7.3 (2020-08-19)
Improved support for R notebooks (.Rmd)
New section for testing.sql module in the documentation
0.7.2 (2020-08-17)
New guides: parametrized pipelines, SQL templating, pipeline testing and debugging
NotebookRunner.debug(kind='pm') for post-mortem debugging
Fixes bug in Jupyter extension when the pipeline has a task whose source is not a file (e.g. SQLDump)
Fixes a bug in the CLI custom arg parser that caused dynamic params not to show up
DAGspec now supports SourceLoader
Docstring (from dotted path entry point) is shown in the CLI summary
Customized sphinx build to execute guides from notebooks
0.7.1 (2020-08-06)
Support for R
Adding section on R pipeline to the documentation
Construct pipeline from a directory (no need to write a pipeline.yaml file)
Improved error messages when DAG fails to initialize (jupyter notebook app)
Bug fixes
CLI accepts factory function parameters as positional arguments, types are inferred using type hints, displayed when calling --help
CLI accepts env variables (if any), displayed when calling --help
0.7 (2020-07-30)
Simplified CLI (breaking changes)
Refactors internal API for notebook conversion, adds tests for common formats
Metadata is deleted when saving a script from the Jupyter notebook app to make sure the task runs in the next pipeline build
SQLAlchemyClient now supports custom tokens to split source
0.6.3 (2020-07-24)
Adding –log option to CLI commands
Fixes a bug that caused the dag variable not to be exposed during interactive sessions
Fixes ploomber task forced run
Adds SQL pipeline tutorial to get started docs
Minor CSS changes to docs
0.6.2 (2020-07-22)
Support for env.yaml in pipeline.yaml
Improved CLI. Adds plot, report and task commands
0.6.1 (2020-07-20)
Changes pipeline.yaml default (extract_product: True)
Documentation re-design
Simplified “ploomber new” generated files
Ability to define “product” in SQL scripts
Products are resolved to absolute paths to avoid ambiguity
Bug fixes
0.6 (2020-07-08)
Adds Jupyter notebook extension to inject parameters when opening a task
Improved CLI ploomber new, ploomber add and ploomber entry
Spec API documentation additions
Support for on_finish, on_failure and on_render hooks in spec API
Improved validation for DAG specs
Several bug fixes
0.5.1 (2020-06-30)
Reduces the number of required dependencies
A new option in DBAPIClient to split source with a custom separator
0.5 (2020-06-27)
Adds CLI
New spec API to instantiate DAGs using YAML files
NotebookRunner.debug() for debugging and .develop() for interacive development
Bug fixes
0.4.1 (2020-05-19)
PythonCallable.debug() now works in Jupyter notebooks
0.4.0 (2020-05-18)
PythonCallable.debug() now uses IPython debugger by default
Improvements to Task.build() public API
Moves hook triggering logic to Task to simplify executors implementation
Adds DAGBuildEarlyStop exception to signal DAG execution stop
New option in Serial executor to turn warnings and exceptions capture off
Adds Product.prepare_metadata hook
Implements hot reload for notebooks and python callables
General clean ups for old __str__ and __repr__ in several modules
Refactored ploomber.sources module and ploomber.placeholders (previously ploomber.templates)
Adds NotebookRunner.debug() and NotebookRunner.develop()
NotebookRunner: now has an option to run static analysis on render
Adds documentation for DAG-level hooks
Bug fixes
0.3.5 (2020-05-03)
Bug fixes #88, #89, #90, #84, #91
Modifies Env API: Env() is now Env.load(), Env.start() is now Env()
New advanced Env guide added to docs
Env can now be used with a context manager
Improved DAGConfigurator API
Deletes logger configuration in executors constructors, logging is available via DAGConfigurator
0.3.4 (2020-04-25)
Dependencies cleanup
Removed (numpydoc) as dependency, now optional
A few bug fixes: #79, #71
All warnings are captured and shown at the end (Serial executor)
Moves differ parameter from DAG constructor to DAGConfigurator
0.3.3 (2020-04-23)
Cleaned up some modules, deprecated some rarely used functionality
Improves documentation aimed to developers looking to extend ploomber
Introduces DAGConfigurator for advanced DAG configuration [Experimental API]
Adds task to upload files to S3 (ploomber.tasks.UploadToS3), requires boto3
Adds DAG-level on_finish and on_failure hooks
Support for enabling logging in entry points (via –logging)
Support for starting an interactive session using entry points (via python -i -m)
Improved support for database drivers that can only send one query at a time
Improved repr for SQLAlchemyClient, shows URI (but hides password)
PythonCallable now validates signature against params at render time
Bug fixes
0.3.2 (2020-04-07)
Faster Product status checking, now performed at rendering time
New products: GenericProduct and GenericSQLRelation for Products that do not have a specific implementation (e.g. you can use Hive with the DBAPI client + GenericSQLRelation)
Improved DAG build reports, subselect columns, transform to pandas.DataFrame and dict
Parallel executor now returns build reports, just like the Serial executor
0.3.1 (2020-04-01)
DAG parallel executor
Interact with pipelines from the command line (entry module)
Bug fixes
Refactored access to Product.metadata
0.3 (2020-03-20)
New Quickstart and User Guide section in documentation
DAG rendering and build now continue until no more tasks can render/build (instead of failing at the first exception)
New @with_env and @load_env decorators for managing environments
Env expansion ({{user}} expands to the current, also {{git}} and {{version}} available)
Task.name is now optional when Task is initialized with a source that has __name__ attribute (Python functions) or a name attribute (like Placeholders returned from SourceLoader)
New Task.on_render hook
Bug fixes
A lot of new tests
Now compatible with Python 3.5 and higher
0.2.1 (2020-02-20)
Adds integration with pdb via PythonCallable.debug
Env.start now accepts a filename to look for
Improvements to data_frame_validator
0.2 (2020-02-13)
Simplifies installation
Deletes BashCommand, use ShellScript
More examples added
Refactored env module
Renames SQLStore to SourceLoader
Improvements to SQLStore
Improved documentation
Renamed PostgresCopy to PostgresCopyFrom
SQLUpload and PostgresCopy have now the same API
A few fixes to PostgresCopy (#1, #2)
0.1
First release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.