Skip to main content

Combine Kedro data science pipelines with Great Expectations data validations.

Project description

kedro-expectations

A tool to better integrate Kedro and Great Expectations

Introduction

Kedro Expectations is a tool designed to make the use of Great Expectations (GX, a data validation tool) within Kedro data science pipelines easier. It is composed of a couple of commands and a hook, allowing the user to create expectation suites and run validations on the Kedro DataCatalog on-the-fly. Check out our blog post for a deeper dive into the workings and motivation behind this project!

Features

  • ⏳ Initialization of GX without having to worry about datasources
  • 🎯 Creation of GX suites automatically, using the Data Assistant profiler
  • 🚀 Running validations within the Kedro pipeline on-the-fly
  • ⚡ Optional: Parallel running validations to prevent blocking the Kedro pipeline
  • 🔔 Custom notification setup to keep up-to-date about validations

Installation

You can install the plugin via PyPI:

pip install kedro-expectations

Usage

CLI Usage

As a first step to use the Kedro Expectations run the following command to create an expectation suite for a given Kedro data set:

kedro expectations create-suite

You are guided by a dialog and the script automatically analyzes the dataset using a DataAssistant profiler. It is possible to create expectation suites for Non-spark dataframe objects (there is no need to worry about the file type since Kedro Expectations utilizes the information from the Kedro data catalog) and partitioned datasets. Within partitioned datasets, it is possible to create generic expectations, meaning all the partitions will use that expectation, or specific expectations, meaning only the specified partition will use the generated expectation.

Besides creating the expectation suite, the command also creates the base GX folder and the datasources / assets it needs to run Great Expectations, given that they don't exist already.

Hook Usage

In order to enable the hook capabilities you only need to register it in the settings.py file inside your kedro project.

(inside src/your_project_name/settings.py)

from kedro_expectations import KedroExpectationsHooks

HOOKS = (KedroExpectationsHooks(
            on_failure="raise_fast",
            ),
        )

There you can specifiy the parameters that you want to start your hook with. Additionally: You can specifiy the hook_options in your parameters.yml (conf/base/parameters.yml) that is generated automatically by kedro. These options are preferred over the ones specified in the settings.py registration and will override them once kedro starts up!

hook_options:
  on_failure: raise_later
  parallel_validation: True
  check_orphan_expectation_suites: True
  single_datasource_check: True
  expectation_tags: null
  notify_config: 
    module: kedro_expectations.notification
    class: EmailNotifier
    kwargs:
      recipients: ["john.doe@anacision.de"]
      smtp_address: smtp.testserver.com
      smtp_port: 123
      sender_login: dummylogin
      sender_password: dummypassword
      security_protocol: None

Parameters

The hook allows for different parameters in order to customize it to your desired behavior.

on_failure: is a parameter added to give more control over the pipeline. That way it is possible to define, if an expectation's validation failure breaks the pipeline run immediately (on_failure="raise_fast"), at the end (on_failure="raise_later) or not at all (on_failure="continue"). Its default value is "continue".

parallel_validation: is a parameter to control whether expectation validations are run in a seperate process or not. This is useful because some validations may take a long time and the result might not be relevant for the further continuation of the pipeline, thus a parallel process validation allows the kedro pipeline to continue running. Logically, the option is NOT available for the on_failure=raise_fast mode.

check_orphan_expectation_suites: controls whether to check (and potentially raise errors) for defined expectation suites that do not have a corresponding data source.

single_datasource_check: controls whether the same datasource is validated every time it is used in a kedro pipeline or only at its first encounter.

expectation_tags: List of tags used to filter which expectation suites will be used for validation.

Notification

With the notify_config argument you can set up automatic notifications about the validation run. It uses the GX checkpoint actions to render and send the notification. Currently only notification via email is supported. To set it up, add the following argument to the KedroExpectationsHooks object within settings.py and modify the addresses and credentials according to your SMTP server and needs. Alternatively you can use the parameters.yml like shown in the example above.

from kedro_expectations import KedroExpectationsHooks
from kedro_expectations.notification import EmailNotifier

HOOKS = (KedroExpectationsHooks(
            notify_config=EmailNotifier(
              recipients=["john_doe@nobody.io", ],
              sender_login="login",
              sender_password="password",
              smtp_address="smtp.address",
              smtp_port="465"
              )
            ),
        )

Example

To make things clearer, the following example will walk you through usage of the plugin, from setup to creating an expectation suite to finally running the pipeline. It was done using the Spaceflights Starter project provided by kedro.

To start using the plugin, make sure you are in your project's root folder and your pipeline is executing correctly.

Considering you have the plugin installed and the conditions right above are true, the main steps are:

  • Create one or more suites depending on your needs
  • Make sure to insert the KedroExpectationsHooks in your project settings' HOOKS list
  • Execute the Kedro Pipeline as usual

Suite Creation

You can start using the plugin directly by running the command for creating an expectation suite: "kedro expectations create-suite". You will be prompted to choose between (1) suites for generic datasets and (2) suites for partitioned datasets. In this example we will choose (1) to create a generic dataset.

After that, we will be asked to enter the dataset name. Please enter the exact name of your dataset as defined in the catalog.yml of your kedro project. In the next step, you can freely choose a name for the expectation suite that is about to be created. The plugin will load the dataset and display all its available columns. Now you can choose which columns to exclude from your new expectation suite by typing each name one by one into the terminal. If you want to include every column, just input '0' directly to proceed with the creation.

After this step is done, the plugin will automatically create an expectation suite for the specified dataset based on the data currently present inside it.

You should be able to find this newly generated expectation suite in your project structure under gx/expectations/"dataset_name"/"ex_suite_name".json.

Adding the Hook

Now, to be able to test, we only need to add a few lines of code in our settings.py file as shown above

For more information about the functionality of Kedro Hooks, please refer to the Kedro Hook Documentation

Running the Kedro project

After adding the Hook there is no more extra step required. You can simply run the project by using the default "kedro run" command. Whenever a dataset with an existing expectation suite is called by the pipeline, kedro-expectations will validate it, add the results to the data_docs and (optionally) notify you.

Contribution

Based on work from Joao Gabriel Pampanin de Abreu. Extended and updated by anacision GmbH since 2023. For details about how to contribute or to report issues, reach out to us via tech@anacision.de or to any of the people listed below.

Main Developers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_expectations-0.6.1.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

kedro_expectations-0.6.1-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file kedro_expectations-0.6.1.tar.gz.

File metadata

  • Download URL: kedro_expectations-0.6.1.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for kedro_expectations-0.6.1.tar.gz
Algorithm Hash digest
SHA256 a20d13d8d403b3de8b6e9b3de00668ba35a0fdcd9a15344c73b758d9b0024e62
MD5 d18de7aef362fbf7c6614d38a1a3958b
BLAKE2b-256 1158fdb92d1390b0ad3e5152a5bf6b7703d0715f6d6f33211fb3b4a51c75d0b9

See more details on using hashes here.

File details

Details for the file kedro_expectations-0.6.1-py3-none-any.whl.

File metadata

File hashes

Hashes for kedro_expectations-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2add0e994b1ce01da817a1d5f6074baf65255771ba7d75f370d97fefcd2678b1
MD5 019087e59cc1c27d82eb525e2829b06f
BLAKE2b-256 8ca8e99bd3ef3024a6a4805ce60db8260c9e34b7775d019bc8e3281143dc6f09

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page