Skip to main content

Combine Kedro data science pipelines with Great Expectations data validations.

Project description

Latest Release PyPI - Version Pipeline Status Coverage GitLab Last Commit GitLab Last Commit Python version License Maintainers


kedro-expectations

A tool to better integrate Kedro and Great Expectations

Introduction

Kedro Expectations is a tool designed to make the use of Great Expectations (GX, a data validation tool) within Kedro data science pipelines easier. It is composed of a couple of commands and a hook, allowing the user to create expectation suites and run validations on the Kedro DataCatalog on-the-fly. Check out our blog post for a deeper dive into the workings and motivation behind this project!

Features

  • ⏳ Initialization of GX without having to worry about datasources
  • 🎯 Creation of GX suites automatically, using the Data Assistant profiler
  • 🚀 Running validations within the Kedro pipeline on-the-fly
  • ⚡ Optional: Parallel running validations to prevent blocking the Kedro pipeline
  • 🔔 Custom notification setup to keep up-to-date about validations

Installation

You can install the plugin via PyPI:

pip install kedro-expectations

Usage

CLI Usage

As a first step to use the Kedro Expectations run the following command to create an expectation suite for a given Kedro data set:

kedro expectations create-suite

You are guided by a dialog and the script automatically analyzes the dataset using a DataAssistant profiler. It is possible to create expectation suites for Non-spark dataframe objects (there is no need to worry about the file type since Kedro Expectations utilizes the information from the Kedro data catalog) and partitioned datasets. Within partitioned datasets, it is possible to create generic expectations, meaning all the partitions will use that expectation, or specific expectations, meaning only the specified partition will use the generated expectation.

Besides creating the expectation suite, the command also creates the base GX folder and the datasources / assets it needs to run Great Expectations, given that they don't exist already.

Hook Usage

In order to enable the hook capabilities you only need to register it in the settings.py file inside your kedro project.

(inside src/your_project_name/settings.py)

from kedro_expectations import KedroExpectationsHooks

HOOKS = (KedroExpectationsHooks(
            on_failure="raise_fast",
            ),
        )

There you can specifiy the parameters that you want to start your hook with. Additionally: You can specifiy the hook_options in your parameters.yml (conf/base/parameters.yml) that is generated automatically by kedro. These options are preferred over the ones specified in the settings.py registration and will override them once kedro starts up!

hook_options:
  on_failure: raise_later
  parallel_validation: True
  check_orphan_expectation_suites: True
  single_datasource_check: True
  expectation_tags: null
  notify_config: 
    module: kedro_expectations.notification
    class: EmailNotifier
    kwargs:
      recipients: ["john.doe@anacision.de"]
      smtp_address: smtp.testserver.com
      smtp_port: 123
      sender_login: dummylogin
      sender_password: dummypassword
      security_protocol: None

Parameters

The hook allows for different parameters in order to customize it to your desired behavior.

on_failure: is a parameter added to give more control over the pipeline. That way it is possible to define, if an expectation's validation failure breaks the pipeline run immediately (on_failure="raise_fast"), at the end (on_failure="raise_later) or not at all (on_failure="continue"). Its default value is "continue".

parallel_validation: is a parameter to control whether expectation validations are run in a seperate process or not. This is useful because some validations may take a long time and the result might not be relevant for the further continuation of the pipeline, thus a parallel process validation allows the kedro pipeline to continue running. Logically, the option is NOT available for the on_failure=raise_fast mode.

max_processes: Maximum number of processes that can run concurrently for the parallel validation mode. Defaults to the number of CPU cores in your system.

check_orphan_expectation_suites: controls whether to check (and potentially raise errors) for defined expectation suites that do not have a corresponding data source.

single_datasource_check: controls whether the same datasource is validated every time it is used in a kedro pipeline or only at its first encounter.

expectation_tags: List of tags used to filter which expectation suites will be used for validation.

Notification

With the notify_config argument you can set up automatic notifications about the validation run. It uses the GX checkpoint actions to render and send the notification. Currently only notification via email is supported. To set it up, add the following argument to the KedroExpectationsHooks object within settings.py and modify the addresses and credentials according to your SMTP server and needs. Alternatively you can use the parameters.yml like shown in the example above.

from kedro_expectations import KedroExpectationsHooks
from kedro_expectations.notification import EmailNotifier

HOOKS = (KedroExpectationsHooks(
            notify_config=EmailNotifier(
              recipients=["john_doe@nobody.io", ],
              sender_login="login",
              sender_password="password",
              smtp_address="smtp.address",
              smtp_port="465"
              )
            ),
        )

Example

To make things clearer, the following example will walk you through usage of the plugin, from setup to creating an expectation suite to finally running the pipeline. It was done using the Spaceflights Starter project provided by kedro.

To start using the plugin, make sure you are in your project's root folder and your pipeline is executing correctly.

Considering you have the plugin installed and the conditions right above are true, the main steps are:

  • Create one or more suites depending on your needs
  • Make sure to insert the KedroExpectationsHooks in your project settings' HOOKS list
  • Execute the Kedro Pipeline as usual

Suite Creation

You can start using the plugin directly by running the command for creating an expectation suite: "kedro expectations create-suite". You will be prompted to choose between (1) suites for generic datasets and (2) suites for partitioned datasets. In this example we will choose (1) to create a generic dataset.

After that, we will be asked to enter the dataset name. Please enter the exact name of your dataset as defined in the catalog.yml of your kedro project. In the next step, you can freely choose a name for the expectation suite that is about to be created. The plugin will load the dataset and display all its available columns. Now you can choose which columns to exclude from your new expectation suite by typing each name one by one into the terminal. If you want to include every column, just input '0' directly to proceed with the creation.

After this step is done, the plugin will automatically create an expectation suite for the specified dataset based on the data currently present inside it.

You should be able to find this newly generated expectation suite in your project structure under gx/expectations/"dataset_name"/"ex_suite_name".json.

Adding the Hook

Now, to be able to test, we only need to add a few lines of code in our settings.py file as shown above

For more information about the functionality of Kedro Hooks, please refer to the Kedro Hook Documentation

Running the Kedro project

After adding the Hook there is no more extra step required. You can simply run the project by using the default "kedro run" command. Whenever a dataset with an existing expectation suite is called by the pipeline, kedro-expectations will validate it, add the results to the data_docs and (optionally) notify you.


Contribution

Based on work from Joao Gabriel Pampanin de Abreu. Extended and updated by anacision GmbH since 2023. For details about how to contribute or to report issues, reach out to us via tech@anacision.de or to any of the people listed below.

Main Developers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_expectations-0.6.3.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

kedro_expectations-0.6.3-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file kedro_expectations-0.6.3.tar.gz.

File metadata

  • Download URL: kedro_expectations-0.6.3.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for kedro_expectations-0.6.3.tar.gz
Algorithm Hash digest
SHA256 b6766f60951457730869953963a39e31ed62be51e8aabe6705900e3149507c64
MD5 790841d8aada7ee89466b47d593c9bda
BLAKE2b-256 6df1783e29ba0c461e2643b6fa14e55afcd5c3fabeee85dea7ceea75ddc68401

See more details on using hashes here.

File details

Details for the file kedro_expectations-0.6.3-py3-none-any.whl.

File metadata

File hashes

Hashes for kedro_expectations-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6ffe729b8f143634734424314b7d41142245d37efec55a212d7bcaf97230b1dc
MD5 366c2960d804a5df73b76bcc9c17ce71
BLAKE2b-256 fd24b010c55b0674ae02a5ccb5e4b3353db4f1c501cc084f51d338b4d6c5fa81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page