dPipes - Pythonic Data Pipelines
Project description
dPipes - Pythonic Data Pipelines
About
dPipes
is a Python package for creating reusable, modular, and composable data pipelines.
It's small project that came out of the desire to turn this:
import pandas as pd
data = (data.pipe(func_1)
.pipe(func_2)
.pipe(func_3)
)
into this:
from dpipes.processor import PipeProcessor
ps = PipeProcessor(
funcs=[func_1, func_2, func_3]
)
data = ps(data)
Now, arguably, there is not much functional difference between the two implementations. They both accomplish the same task with roughly the same amount of code.
But, what happens if you want to apply the same pipeline of functions to a different data object?
Using the first method, you'd need to re-write (copy/paste) your method-chaining pipeline:
new_data = (new_data.pipe(func_1)
.pipe(func_2)
.pipe(func_3)
)
Using the latter method, you'd only need to pass in a different object to the pipeline:
new_data = ps(new_data)
Under the Hood
dPipes
uses two functions from Python's functools
module: reduce
and partial
. The reduce
function enables function composition; the partial
function enables use of arbitrary kwargs
.
Generalization
Although dPipes
initially addressed pd.DataFrame.pipe
method-chaining, it's extensible to any
API that implements a pandas-like DataFrame.pipe
method (e.g. Polars). Further, the
dpipes.pipeline
extends this composition to any arbitrary Python function.
That is, this:
result = func_3(func_2(func_1(x)))
or this:
result = func_1(x)
result = func_2(result)
result = func_3(result)
becomes this:
from dpipes.pipeline import Pipeline
pl = Pipeline(funcs=[func_1, func_2, func_3])
result = pl(x)
which is, arguably, more readable and, once again, easier to apply to other objects.
Installation
dPipes
is can be installed via pip
:
pip install dpipes
We recommend setting up a virtual environment with Python >= 3.8.
Benefits
Reusable Pipelines
As you'll see in the tutorials,
one of the key benefits of using dPipes
is the reusable pipeline object that can be called on
multiple datasets (provided their schemas are similar):
for ds in [split_1, split_2, split_3]:
result_b = ps(ds)
pd.testing.assert_frame_equal(result_a, result_b)
Modular Pipelines
Another is the ability to create modularized pipelines that can easily be imported and used elsewhere in code:
"""My pipeline module."""
from dpipes.processor import PipeProcessor
def task_1(...):
...
def task_2(...):
...
def task_3(...):
...
def task_4(...):
...
my_pipeline = PipeProcessor([task_1, task_2, task_3, task_4])
from my_module import my_pipeline
my_pipeline(my_data)
Composable Pipelines
Finally, you can compose large, complex processing pipelines using an arbitrary number of sub-pipelines:
ps = PipeProcessor([
task_1,
task_2,
task_3,
task_4,
])
col_ps_single = ColumnPipeProcessor(
funcs=[task_5, task_6],
cols="customer_id"
)
col_ps_multi = ColumnPipeProcessor(
funcs=[task_7, task_8],
cols=["customer_id", "invoice"]
)
col_ps_nested = ColumnPipeProcessor(
funcs=[task_9, task_10],
cols=[
["quantity", "price"],
["invoice"],
]
)
pipeline = PipeProcessor([
ps,
col_ps_single,
col_ps_multi,
col_ps_nested,
])
result = pipeline(data)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dpipes-0.1.1.tar.gz
.
File metadata
- Download URL: dpipes-0.1.1.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51bf634245c5e600d825c3535db261c782f69da0c8ff07ae02eb107c38b8ec88 |
|
MD5 | b551bab51c381e910533b8a18f49c844 |
|
BLAKE2b-256 | 6c510074eb7c34cf32d99558e7700279ac3483eef898e6c66385b336bc8dee2f |
File details
Details for the file dpipes-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: dpipes-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73e801054e71f83518cb3c861f6b88e1c3eccadbc1413aa997819befe65c550f |
|
MD5 | df2c7b02815d5ab5936b53cb132763c2 |
|
BLAKE2b-256 | d28e29a2540226618b418468fb757dc598aab538d0ab52bcc568d9b383e65811 |