Citrine Informatics ETL pipeline
Project description
Piperoni
Piperoni is a lightweight ETL framework for any data type, which allows you to make, track, and visualize atomic data transformations. Unlike some ETL tools, Piperoni relies on in-memory transformation, and thus is ideal for manipulating complex, diverse non-"big"-data.
Piperoni allows you to make and track atomic data transformations, ensures expected types are being passed from transformation to transformation, and allows you to easily see the state of the data at any point in time. Piperoni is a great tool for collaborative data pipelines, where visibility into data transformations is key.
Getting Started
piperoni
is a framework for ETL and data pipeline work. To get started, first install piperoni
:
pip install git+ssh://git@github.com/CitrineInformatics/piperoni.git
Documentation
Detailed instructions on installation and usage can be found in the complete piperoni docs
Contributing
The following best practices are required for contributing.
In this repo, we follow PEP8 standards (using Black) and include Docstrings in all of work.
All functions should have unit testing.
Best Practices
-
Never use branching code in a
Pipeline
(e.g. if, else) without an explicit warning or failure. Particularly, do not use branching if the branches give rise to same or similar data. -
Do not use
deepcopy()
in any operators; this will cause unexpected behavior. -
Keep transforms atomic! This is the reason for Piperoni. Don't be lazy.
-
Stuck? Piperoni logs every transformation! Just set it to debug mode!
-
Have intermediate states be optionally output by using
Checkpoints
-
Do not use nestled Types when defining Types in your Operators (e.g.
Dict
notDict[str, str]
) -
Avoid hidden-states / adopt functional programming practices whenever possible
-
Avoid multiple versions of files for optioning. Adopt argparse or similar instead whenever possible.
-
Use named variables and either avoid or fill in optional variables in function calls.
-
Do not hard code column names or similar, even when the function only ever applies to a single column or instance.
-
Have a trusted reference. Always compare to trusted reference after changes to the pipeline. Update the reference as needed.
Flagging Bugs and Requesting New Features
We funnel Bugs and Feature requests through Github issues. Create a new issue and select Bug Report or Feature Request (If you have neither a bug or feature request, open a regular issue). Add a concise title, fill in the template, and submit the issue.
Citations
Example Band Gap data used in the example are from: Strehlow, W. H., & Cook, E. L. (1973). Compilation of energy band gaps in elemental and binary compound semiconductors and insulators. Journal of Physical and Chemical Reference Data, 2(1), 163-200.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file piperoni-3.0.7.tar.gz
.
File metadata
- Download URL: piperoni-3.0.7.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13d6a4ad7eaaa4e8fa2d65f5997e418ba417ccf42e42998ecd3131fdb4e008a5 |
|
MD5 | fa9d84c6dbd994f17dbeada92c2846b8 |
|
BLAKE2b-256 | 4318ee830ba56c447ead4db9e96dd960e0cdfb599dad671a24ee2e14fbb89313 |
File details
Details for the file piperoni-3.0.7-py3-none-any.whl
.
File metadata
- Download URL: piperoni-3.0.7-py3-none-any.whl
- Upload date:
- Size: 29.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af9c5cb8ff1d07da4d676255a81e2aa1d31c82aa2026063cd9edebc0e913386f |
|
MD5 | 8ccb623d1b667f1d603367fc3d60e5bd |
|
BLAKE2b-256 | c3db9a877b50893fcdbb4fc63fe68326871922f728e2d2154144ace97c75f0db |