Kedro-Accelerator speeds up pipelines by parallelizing I/O in the background.
Kedro pipelines consist of nodes, where an output from one node A can be an input to another node B. The Data Catalog defines where and how Kedro loads and saves these inputs and outputs, respectively. By default, a sequential Kedro pipeline:
- runs node A
- persists the output of A, often to remote storage like Amazon S3
- potentially runs other nodes
- fetches the output of A, loading it back into memory
- runs node B
Persisting intermediate data sets enables partial pipeline runs (e.g. running node B without rerunning node A) and analysis/debugging of these data sets. However, the I/O in steps 2 and 4 above was not necessary to run node B, given the requisite data was already in memory after step 1. Kedro-Accelerator speeds up pipelines by parallelizing this I/O in the background.
How do I install Kedro-Accelerator?
Kedro-Accelerator is a Python plugin. To install it:
pip install kedro-accelerator
How do I use Kedro-Accelerator?
As of Kedro 0.16.4,
TeePlugin—the core of Kedro-Accelerator—will be auto-discovered upon installation. In older versions, hook implementations should be registered with Kedro through the
ProjectContext. Hooks were introduced in Kedro 0.16.0.
The following conditions must be true for Kedro-Accelerator to speed up your pipeline:
- Your pipeline must not use transcoding.
- Your project must use
The Kedro-Accelerator repository includes the Iris data set example pipeline generated using Kedro 0.16.1. Intermediate data sets have been replaced with custom
SlowDataSet instances to simulate a slow filesystem. You can try different load and save delays by modifying
To get started, create and activate a new virtual environment. Then, clone the repository and pip install requirements:
git clone https://github.com/deepyaman/kedro-accelerator.git cd kedro-accelerator KEDRO_VERSION=0.16.5 pip install -r src/requirements.txt # Specify your desired Kedro version.
You can compare pipeline execution times with and without
TeePlugin. Kedro-Accelerator also provides
CachePlugin so that you can test performance using
CachedDataSet in asynchronous mode. Assuming parametrized load and save delays of 10 seconds for intermediate datasets, you should see the following results:
|Baseline (i.e. no caching/plugins)||
||10 seconds (saving all outputs)||Log|
||30 seconds (saving
For a more complete discussion of the above benchmarks, see quantumblacklabs/kedro#420 (comment).
What license do you use?
Kedro-Accelerator is licensed under the MIT License.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size kedro_accelerator-0.1.0-py3.8.egg (6.5 kB)||File type Egg||Python version 3.8||Upload date||Hashes View|
|Filename, size kedro_accelerator-0.1.0-py3-none-any.whl (4.8 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
Hashes for kedro_accelerator-0.1.0-py3.8.egg
Hashes for kedro_accelerator-0.1.0-py3-none-any.whl