Skip to main content

A package to write schema-aware data pipelines

Project description

Build Status Coverage Status Documentation Status

SchemaFlow

This is a a package to write data pipelines for data science systematically in Python. Thanks for checking it out.

Check out the very comprehensive documentation here.

The problem that this package solves

A major challenge in creating a robust data pipeline is guaranteeing interoperability between pipes: how do we guarantee that the pipe that someone wrote is compatible with others' pipe without running the whole pipeline multiple times until we get it right?

The solution that this package adopts

This package declares an API to define a stateful data transformation that gives the developer the opportunity to declare what comes in, what comes out, and what states are modified on each pipe and therefore the whole pipeline. Check out tests/test_pipeline.py or examples/end_to_end_kaggle.py

Install

pip install schemaflow

or, install the latest (recommended for now):

git clone https://github.com/jorgecarleitao/schemaflow
cd schemaflow && pip install -e .

Run examples

We provide one example that demonstrate the usage of SchemaFlow's API on developing an end-to-end pipeline applied to one of Kaggle's exercises.

To run it, download the data in that exercise to examples/all/ and run

pip install -r examples/requirements.txt
python examples/end_to_end_kaggle.py

You should see some prints to the console as well as the generation of 3 files at examples/: two plots and one submission.txt.

Run tests

pip install -r tests/requirements.txt
python -m unittest discover

Build documentation

pip install -r docs/requirements.txt
cd docs && make html && cd ..
open docs/build/html/index.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemaflow-0.2.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

schemaflow-0.2.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file schemaflow-0.2.0.tar.gz.

File metadata

  • Download URL: schemaflow-0.2.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for schemaflow-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1333fc45a8f6ffb7170e0d803b45bfbfc0d09c1a50380718bfb1fffea2bb931d
MD5 b1854ae7d328993317e61894edb6b3a0
BLAKE2b-256 9e2e7a630cb63a3f0aa05e30ecb308b29143a4f5706688421781f94421907ad5

See more details on using hashes here.

File details

Details for the file schemaflow-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: schemaflow-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for schemaflow-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db3f4930ab135176b85ba01067b608959f1bcb6ec64bbcbbe510b3ed73f3983c
MD5 bfc8f9a9559593e1bc58a79adf373e9a
BLAKE2b-256 3fb378a1499748782bac31bff0208faca37da748839d04805aaf22ef8454ba1a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page