Skip to main content

A Kedro plugin to utilise pandas dropins (like cuDF or modin) in place of the pandas datasets

Project description

logo

kedro-dataframe-dropin

github-action code-style license

How do I get started?

$ pip install kedro-dataframe-dropin --upgrade

Then what?

Replace your pandas.*DataSet in your catalog.yml with

kedro_dataframe_dropin.[rapids|modin].*DataSet

and reap the benefits, as long as your node and pipeline code is compatible with the cudf/modin API (that tries to replicate pandas as much as possible) and your data format is supported by the respective libraries (for example, cudf doesn't support the read_excel method)

What is kedro-dataframe-dropin?

kedro-dataframe-dropin is a Kedro plugin that provides modified versions of the pandas.* dataset definitions (e.g pandas.CSVDataSet) from Kedro, where each dataset has been replaced with one of pandas drop-in replacements.

For example kedro_dataframe_dropin.modin.CSVDataSet replicates pandas.CSVDataSet but with the modin.pandas package replacing pandas. Likewise, kedro_dataframe_dropin.rapids.CSVDataSet provides a cuDF-backed version of the CSVDataSet.

Why does this exist?

There might be several reasons why you'd want to consider a drop-in replacement for Pandas. The use-cases are outlined in various places, such as: the modin documentation or the RAPIDS website.

However, the only dataframe-backed datasets that Kedro has out of the box are the pandas and pyspark ones. If you wanted to use, say, a modin dataframe backed by Dask or Ray, you'd need to write a custom dataset for each file format (.csv, .xls, etc...).

This lets you swap out your catalog.yml from:

# conf/base/catalog.yml [before]
rockets:
    type: pandas.CSVDataSet
    filepath: data/01_raw/rockets.csv

reviews:
    type: pandas.ExcelDataSet
    filepath: data/01_raw/reviews.xslsx

to:

# conf/base/catalog.yml [after]
rockets:
    type: kedro_dataframe_dropin.rapids.CSVDataSet
    filepath: data/01_raw/rockets.csv

reviews:
    type: kedro_dataframe_dropin.modin.ExcelDataSet
    filepath: data/01_raw/reviews.xlsx

and as long as the code within your nodes fits within modin or cudf's implementation of a subset of the pandas API, you'll be done!

What dropins are currently supported?

dropin supported
modin[ray]
modin[dask]
cudf
dask 🟠
dask-cudf 🟠

✅: compatible 🟠: No kedro versioning and some datasets (like SQLTableDataSet) don't work despite being available on both kedro and the drop-in.

What happens when Kedro adds or changes a pandas dataset?

The beauty of it is that this will stay in complete sync with Kedro's pandas.* library without any code changes or releases required. It's implemented through hot-swapping the pandas module with one of the replacements you specified.

Examples

As an example of why you might want to use this, here are the results of some very rough and preliminary benchmarking. These were conducted on a Google Colaboratory notebook (thanks Google!) with a Tesla T4 GPU and a 2-core CPU. The data used was a 5 million row CSV, weighing in at around a 100mb downloaded from here.

# base/conf/catalog.yml
cudf:
  type: kedro_dataframe_dropin.rapids.CSVDataSet
  filepath: data/01_raw/data.csv

pandas:
  type: pandas.CSVDataSet
  filepath: data/01_raw/data.csv

Using the two datasets within the kedro ipython console shows a world of difference, with reading the file in being 10x faster, doing a groupby being 6x faster and taking the mean being 5x faster.

This helps shorten:

  • The feedback loop when prototyping and exploring your data within a kedro ipython or a kedro jupyter session
  • The feedback loop when running your pipelines in development and debugging/experimenting with various different methodologies
  • Your production runtime
In [1]: %timeit gdf = catalog.load("cudf")
702 ms ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit df = catalog.load("pandas")
8.22 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit gdf.groupby("Region")
4.75 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %timeit df.groupby("Region")
26.7 µs ± 397 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit df["Total Revenue"].mean()
11.8 ms ± 87.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit gdf["Total Revenue"].mean()
2.71 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Any additional benchmarks you do and want to share back would be much appreciated. Feel free to open an issue!

Some special notes on RAPIDS

The rest of the cu* ecosystem

Your data processing step gets faster (assuming you have the right conditions) by plugging in the cudf module from RAPIDs in place of pandas, but it doesn't end there.

You can continue to make use of your GPU speedup in the rest of your pipeline lifecycle (predictions, ML, graph, etc...) by using the rest of the cuda ecosystem of tools (cuML and the ilk) in place of tools like sklearn.

Why are some data formats missing?

With the way this plugin was designed, it only hot swaps in cudf in place of pandas where the Kedro pandas dataset exists.

So as it stands today, with the Kedro codebase not having an ORCDataSet for example, this plugin won't have it either. You'll need to build your own custom own.

Or better yet, head over to the Kedro codebase and contribute the pandas version of it to their codebase. This plugin will then automatically pick it up and provide a cudf-equivalent.

Some special notes on dask-cuDF and dask

Note that dask and dask-cuDF will delay compute and operations across nodes are actually building up a computation graph. They will be parallelised across your CPU/GPU when you invoke a .compute() operation (like len or save it to disk by having its output be a non-memory dataset in the catalog).

Note that Kedro versioning won't be possible with these datasets, since Kedro completely owns the I/O and simply passes the file handle down to dask/dask-cuDF which doesn't accept it - since file handles can't be shared across (CPU or GPU) workers. Instead what we do is extract the filepath and pass it to dask who also use fsspec and so you still have full remote-layer interopability with the benefit of parallelised compute.

Consider giving Matthew Rocklin's blog post on dask-cuDF and the philsophy of it simply being a different "engine" for dask.DataFrame a read.

Caveats

Keep in mind that in order to remain consistent with the adage of not copying memory, when passing these dataframes between nodes, they will not be copied - but simply passed through as the same underlying Python object, so if you're doing mutable operations on them across different nodes, keep in that in mind.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro-dataframe-dropin-0.2.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

kedro_dataframe_dropin-0.2.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file kedro-dataframe-dropin-0.2.0.tar.gz.

File metadata

  • Download URL: kedro-dataframe-dropin-0.2.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.9 Darwin/19.6.0

File hashes

Hashes for kedro-dataframe-dropin-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1be4adf615e7d709528049b7982932d97ed14c495da410ad0fba67cb4520907d
MD5 4f9d9fe505e4f55f083a5e1821aac108
BLAKE2b-256 151dbf56959b80c3f351e92929301056156b0616338c1ce677f188d4d66d71f0

See more details on using hashes here.

File details

Details for the file kedro_dataframe_dropin-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kedro_dataframe_dropin-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0578ceb3c8f8af250d1a67860196093eea5a1d15c75cf06f57621819688743d9
MD5 b7fdeccb8bc827a732010bb0316a2fd1
BLAKE2b-256 97c42d7aa2b6c599bbb374faf64e9a27b93728be8fdbee8f785f08a6eea00e31

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page