a package to write data pipelines for data science systematically
Project description
SchemaFlow
This is a a package to write data pipelines for data science systematically in Python. Thanks for checking it out.
The problem that this package solves
A major challenge in creating a robust data pipeline is guaranteeing interoperability between pipes: how do we guarantee that the pipe that someone wrote is compatible with my pipeline without running the whole pipeline multiple times until I get it right?
The solution this package adopts
This package declares an interface to define a stateful data transformation that gives the developer the opportunity to declare what comes in, what comes out, and what states are modified on each pipe and therefore the whole pipeline.
Install
# git clone the repository
pip install .
Run tests
pip install -r requirements_tests.txt
python -m unittest discover
Build documentation
pip install -r requirements_docs.txt
cd docs && make html && cd ..
open docs/build/html/index.html
Use cases
You have a hadoop cluster with csv/etc., use PySpark to process them and fit a model. There are multiple processing steps developed by many people.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for schemaflow-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7432f613f636fc0af9f3a30c66ae2225d38b3f44c25567f090ac4c7dd0f434fe |
|
MD5 | 3a2b098bdb8030a081261271e6217775 |
|
BLAKE2b-256 | 13833bbb5ba3cffcbf6189621dd4119bbd0c030df49b527b8a4b5acad2bdba0e |