Lightweight Data Pipeline for with code-based stage cacheing

These details have not been verified by PyPI

Project links

Project description

Lightweight Data Pipeline (LWDP)

LWDP attempts to fill the niche for structuring pure-Python data transformations, with robust data- and code-based-cacheing across a few locales.

Because sometimes Spark or Dask or AWS Glue or anything other than a 5kb library and some dumbly hashed files is just too much.

LWDP is meant for the case where you're doing a few data transformations, possibly across multiple input file types ( csvs, Excel, parquet, etc.). Each of these files can generally (although not strictly) be held in memory. 25 csvs with structured transformations that you'd like to keep organized and possibly streamline with cacheing?

LWDP could be the answer.

If the data changes or your code changes, you want to be able to refresh the data pipeline once - and, ideally, only those parts of the data pipeline who need to be refreshed.

Installation

You should be able to install from PyPi with pip install lwdp

Usage

Decorate functions to represent a stage; and chain together functions to make a pipeline.

# read from a raw file and cache
from lwdp import stage
import pandas as pd


@stage(some_raw_file="raw/input.csv", cache=True)
def stg_read_format_raw(**kwargs) -> pd.DataFrame:
    pdf = pd.read_csv(kwargs.get('some_raw'))
    # some stuff to clean it
    return pdf


# read from a previous stage and cache
@stage(basic_raw=stg_read_format_raw, cache=True, format='parquet')
def stg_format_more(**kwargs) -> pd.DataFrame:
    raw = kwargs.get('basic_raw')
    raw['new_analysis_column'] = 3
    return raw


# read from a previous stage without cacheing
@stage(formatted_src=stg_format_more)
def stg_final_process(**kwargs) -> pd.DataFrame:
    result = kwargs.get("formatted_src")
    result['wizard'] = 5
    return result


final_process()

Just call the last stage in the pipeline (as you would any other function) to run all ancestors, reading/writing from cached stages as needed.

How it works

Each stage has a hash computed based on its code (excluding white space and docstrings), its "raw" ancestors, and its stage ancestors. Hash computation for a stage is recursive; and, if any stages change their code, all child stages will have new hashes.

Stages can optionally be cached, and, if so, a format supported by pandas (to_<format> and read_<format> can be specified). If a stage is cached and there exists a file with the specified hash, we read the file instead of recomputing the stage.

Ideally we could do this using distributed persistent storage (e.g. on S3), which is what I'd like to work on next. Then teams who are working on a data pipeline can read from a common source of "raw" files (and cached computations!).

TODO

Deleting cached files after some TTL
Using S3
Hashing the actual data in raw files and using that as part of "raw" data hash (instead of just the filename)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.5

Sep 8, 2022

This version

0.0.4

Aug 29, 2022

0.0.3

Aug 22, 2022

0.0.2

Aug 22, 2022

0.0.1

Aug 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lwdp-0.0.4.tar.gz (5.4 kB view details)

Uploaded Aug 29, 2022 Source

Built Distribution

lwdp-0.0.4-py3-none-any.whl (6.4 kB view details)

Uploaded Aug 29, 2022 Python 3

File details

Details for the file lwdp-0.0.4.tar.gz.

File metadata

Download URL: lwdp-0.0.4.tar.gz
Upload date: Aug 29, 2022
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for lwdp-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`223b9e5472fd69e2251af03f2e18d8f06cb450ce80cce08cd7a8355b3ba5d964`
MD5	`c692c36c803724b629009292f4de3056`
BLAKE2b-256	`9232a80819b2819e01af685f8fd2a2179e801180ded9f5209b3ca8cf268aaced`

See more details on using hashes here.

File details

Details for the file lwdp-0.0.4-py3-none-any.whl.

File metadata

Download URL: lwdp-0.0.4-py3-none-any.whl
Upload date: Aug 29, 2022
Size: 6.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for lwdp-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43b39a1f2f4523f606a852e26479a96d441ba4186c8ccedd6dda0bfe76f291f9`
MD5	`ff9a5e9ffbee3865e5c27e853fc4ce8e`
BLAKE2b-256	`a4f0170a0e6d6be8e011f7c1420eb1e4f099287ef791e3cdf7580654b075a68b`