Skip to main content

A kedro-plugin that adds caching to kedro pipelines

Project description

Kedro Cache

:warning: This plugin is still under active developement and not fully tested. Do not use this in any production systems. Please report any issues that you find.

📝 Description

kedro-cache is a kedro plugin that plugin that enables the caching of data sets. The advantage is that the data sets are loaded from data catalog and not recomputed if they have not changed. If the input data sets or code have changed, the outputs are recomputed and the data catalog is updated. This plugin works out of the box with any kedro project without having to change the code. The logic on how to determine if the cached data set in the catalog should be used is described in the flow chart below.

Caching Flowchart

Disclaimer: The caching strategy determines if a node function has changes by simply looking at the immediate function body. This does not take into account other things such as called function, global variable etc. that might also have changed.

🏆 Features

  • Caching of node outputs in catalog
  • No change to kedro project needed
  • Integration with kedro data catalog
  • Configuration via config.yml file

🏗 Installation

The plugin can be install with pip

pip install kedro-cache

🚀 Enable Caching

In the root directory of your kedro project, run

kedro cache init

This will create a new file cache.yml in the conf directory of your kedro project in which you can configure the kedro-cache module. Although this step is optional as the plugin comes with default configurations.

Next let's assume that you have the following kedro pipeline for which you want to add caching. There are two nodes. One that reads data from a input dataset, does some computations and writes it to a intermediate dataset and one that reads the data from the intermediate dataset and writes it to the output dataset.

# pipeline.py

def register_pipelines() -> Dict[str, Pipeline]:
    default_pipeline = pipeline(
        [
            node(
                func=lambda x: x,
                inputs="input",
                outputs="intermediate",
            ),
            node(
                func=lambda x: x,
                inputs="intermediate",
                outputs="output",
            ),
        ],
    )
    return {"__default__": default_pipeline}

In order to add logging we simply just have to register all used data sets in the data catalog. Because if the first node want to use the cached output instead of recalculating it, it need to load it from the data catalog. This is only possible if it was stored there.

# catalog.yml

input:
  type: pandas.CSVDataSet
  filepath: input.csv

intermediate:
  type: pandas.CSVDataSet
  filepath: intermediate.csv

output:
  type: pandas.CSVDataSet
  filepath: output.csv

And that was it. Just by adding all files to the catalog you enabled caching.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_cache-0.1.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

kedro_cache-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file kedro_cache-0.1.1.tar.gz.

File metadata

  • Download URL: kedro_cache-0.1.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.8.11 Linux/5.15.0-1022-azure

File hashes

Hashes for kedro_cache-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0f7bceb236a7ce01ade6ed3c2baa93d8d5efe2488a577f9c1d73a2effaec99dd
MD5 1c92fa0c4e1961635b949d6b3e0a7dce
BLAKE2b-256 7be43db661b16975d1263b705d8ff9528c058775b852b21e04bed0590570f843

See more details on using hashes here.

File details

Details for the file kedro_cache-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kedro_cache-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.8.11 Linux/5.15.0-1022-azure

File hashes

Hashes for kedro_cache-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 69bb0a264f94aa838fe68e1e889dec1b7425112787f31ac7af6a3a3aea213c90
MD5 d5695edacc896f1aa5d89a9bbdf00c57
BLAKE2b-256 bf8d8c96e94762f27c9d48a3247fc2696895f16a81ce905f57681e8afc45583e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page