Skip to main content

dPipes - Pythonic Data Pipelines

Project description

dPipes - Pythonic Data Pipelines

About

dPipes is a Python package for creating reusable, modular, and composable data pipelines. It's small project that came out of the desire to turn this:

import pandas as pd

data = (data.pipe(func_1)
        .pipe(func_2)
        .pipe(func_3)
)

into this:

from dpipes.processor import PipeProcessor

ps = PipeProcessor(
    funcs=[func_1, func_2, func_3]
)

data = ps(data)

Now, arguably, there is not much functional difference between the two implementations. They both accomplish the same task with roughly the same amount of code.

But, what happens if you want to apply the same pipeline of functions to a different data object?

Using the first method, you'd need to re-write (copy/paste) your method-chaining pipeline:

new_data = (new_data.pipe(func_1)
        .pipe(func_2)
        .pipe(func_3)
)

Using the latter method, you'd only need to pass in a different object to the pipeline:

new_data = ps(new_data)

Under the Hood

dPipes uses two functions from Python's functools module: reduce and partial. The reduce function enables function composition; the partial function enables use of arbitrary kwargs.

Generalization

Although dPipes initially addressed pd.DataFrame.pipe method-chaining, it's extensible to any API that implements a pandas-like DataFrame.pipe method (e.g. Polars). Further, the dpipes.pipeline extends this composition to any arbitrary Python function.

That is, this:

result = func_3(func_2(func_1(x)))

or this:

result = func_1(x)
result = func_2(result)
result = func_3(result)

becomes this:

from dpipes.pipeline import Pipeline

pl = Pipeline(funcs=[func_1, func_2, func_3])
result = pl(x)

which is, arguably, more readable and, once again, easier to apply to other objects.

Installation

dPipes is can be installed via pip:

pip install dpipes

We recommend setting up a virtual environment with Python >= 3.8.

Benefits

Reusable Pipelines

As you'll see in the tutorials, one of the key benefits of using dPipes is the reusable pipeline object that can be called on multiple datasets (provided their schemas are similar):

for ds in [split_1, split_2, split_3]:
    result_b = ps(ds)

pd.testing.assert_frame_equal(result_a, result_b)

Modular Pipelines

Another is the ability to create modularized pipelines that can easily be imported and used elsewhere in code:

"""My pipeline module."""

from dpipes.processor import PipeProcessor


def task_1(...):
    ...


def task_2(...):
    ...


def task_3(...):
    ...


def task_4(...):
    ...


my_pipeline = PipeProcessor([task_1, task_2, task_3, task_4])
from my_module import my_pipeline

my_pipeline(my_data)

Composable Pipelines

Finally, you can compose large, complex processing pipelines using an arbitrary number of sub-pipelines:

ps = PipeProcessor([
    task_1,
    task_2,
    task_3,
    task_4,
])

col_ps_single = ColumnPipeProcessor(
    funcs=[task_5, task_6],
    cols="customer_id"
)

col_ps_multi = ColumnPipeProcessor(
    funcs=[task_7, task_8],
    cols=["customer_id", "invoice"]
)

col_ps_nested = ColumnPipeProcessor(
    funcs=[task_9, task_10],
    cols=[
        ["quantity", "price"],
        ["invoice"],
    ]
)

pipeline = PipeProcessor([
    ps,
    col_ps_single,
    col_ps_multi,
    col_ps_nested,
])

result = pipeline(data)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpipes-0.1.1.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

dpipes-0.1.1-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file dpipes-0.1.1.tar.gz.

File metadata

  • Download URL: dpipes-0.1.1.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dpipes-0.1.1.tar.gz
Algorithm Hash digest
SHA256 51bf634245c5e600d825c3535db261c782f69da0c8ff07ae02eb107c38b8ec88
MD5 b551bab51c381e910533b8a18f49c844
BLAKE2b-256 6c510074eb7c34cf32d99558e7700279ac3483eef898e6c66385b336bc8dee2f

See more details on using hashes here.

File details

Details for the file dpipes-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dpipes-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dpipes-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73e801054e71f83518cb3c861f6b88e1c3eccadbc1413aa997819befe65c550f
MD5 df2c7b02815d5ab5936b53cb132763c2
BLAKE2b-256 d28e29a2540226618b418468fb757dc598aab538d0ab52bcc568d9b383e65811

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page