Skip to main content

Kedro-Accelerator speeds up pipelines by parallelizing I/O in the background.

Project description

Kedro-Accelerator

Kedro pipelines consist of nodes, where an output from one node A can be an input to another node B. The Data Catalog defines where and how Kedro loads and saves these inputs and outputs, respectively. By default, a sequential Kedro pipeline:

  1. runs node A
  2. persists the output of A, often to remote storage like Amazon S3
  3. potentially runs other nodes
  4. fetches the output of A, loading it back into memory
  5. runs node B

Persisting intermediate data sets enables partial pipeline runs (e.g. running node B without rerunning node A) and analysis/debugging of these data sets. However, the I/O in steps 2 and 4 above was not necessary to run node B, given the requisite data was already in memory after step 1. Kedro-Accelerator speeds up pipelines by parallelizing this I/O in the background.

How do I install Kedro-Accelerator?

Kedro-Accelerator is a Python plugin. To install it:

pip install kedro-accelerator

How do I use Kedro-Accelerator?

As of Kedro 0.16.4, TeePlugin—the core of Kedro-Accelerator—will be auto-discovered upon installation. In older versions, hook implementations should be registered with Kedro through the ProjectContext. Hooks were introduced in Kedro 0.16.0.

Prerequisites

The following conditions must be true for Kedro-Accelerator to speed up your pipeline:

  • Your pipeline must not use transcoding.
  • Your project must use SequentialRunner.

Example

The Kedro-Accelerator repository includes the Iris data set example pipeline generated using Kedro 0.16.1. Intermediate data sets have been replaced with custom SlowDataSet instances to simulate a slow filesystem. You can try different load and save delays by modifying catalog.yml.

To get started, create and activate a new virtual environment. Then, clone the repository and pip install requirements:

git clone https://github.com/deepyaman/kedro-accelerator.git
cd kedro-accelerator
KEDRO_VERSION=0.16.5 pip install -r src/requirements.txt  # Specify your desired Kedro version.

You can compare pipeline execution times with and without TeePlugin. Kedro-Accelerator also provides CachePlugin so that you can test performance using CachedDataSet in asynchronous mode. Assuming parametrized load and save delays of 10 seconds for intermediate datasets, you should see the following results:

Strategy Command Total time Log
Baseline (i.e. no caching/plugins) kedro run 2 minutes Log
TeePlugin kedro run --hooks src.kedro_accelerator.plugins.TeePlugin 10 seconds (saving all outputs) Log
CachePlugin (i.e. CachedDataSet) with is_async=True kedro run --async --hooks src.kedro_accelerator.plugins.CachePlugin 30 seconds (saving split_data, train_model, and predict node outputs) Log

For a more complete discussion of the above benchmarks, see quantumblacklabs/kedro#420 (comment).

What license do you use?

Kedro-Accelerator is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

kedro_accelerator-0.1.0-py3.8.egg (6.5 kB view details)

Uploaded Source

kedro_accelerator-0.1.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file kedro_accelerator-0.1.0-py3.8.egg.

File metadata

  • Download URL: kedro_accelerator-0.1.0-py3.8.egg
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0.post20200518 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.3

File hashes

Hashes for kedro_accelerator-0.1.0-py3.8.egg
Algorithm Hash digest
SHA256 e96eb842271c9419156427f34867e73fada6e127fe64fac616da9a27f2080f73
MD5 02fd1ee321145eee4c48926ca6e9ca37
BLAKE2b-256 7184275b2f81c2767ffb07c35530744277bcde2c49096b552979fdd7ab689a01

See more details on using hashes here.

Provenance

File details

Details for the file kedro_accelerator-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kedro_accelerator-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0.post20200518 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.3

File hashes

Hashes for kedro_accelerator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bce47f263415acecef8a25132c6f05cbcda857901b4dbc3e40ab09101db4f415
MD5 acbd75b795da9ed7d0dfc43f64ff54f3
BLAKE2b-256 7b0f97720a6df441e00e1cb3c03b012b8f9f935b5cb2aa1781e3ee80f8a662eb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page