Skip to main content

Combine Kedro data science pipelines with Great Expectations data validations.

Project description

kedro-expectations

A tool to better integrate Kedro and Great Expectations

Introduction

Kedro Expectations is a tool designed to make the use of Great Expectations (GX, a data validation tool) within Kedro data science pipelines easier. It is composed of a couple of commands and a hook, allowing the user to create expectation suites and run validations on the Kedro DataCatalog on-the-fly.

Features

  • ⏳ Initialization of GX without having to worry about datasources
  • 🎯 Creation of GX suites automatically, using the Data Assistant profiler
  • 🚀 Running validations within the Kedro pipeline on-the-fly

Installation

You can install the plugin via PyPI:

pip install kedro-expectations

Usage

CLI Usage

The first step to use the plugin is running an init command. This command will create the base GX folder and create the only datasource the plugin needs

kedro expectations init

After the init command the plugin is ready to create expectation suites automatically using a DataAssistant profiler. It is possible to create expectation suites for Non-spark dataframe objects (there is no need to worry about the file type since Kedro Expectations used the information from the Kedro data catalog) and Partitioned datasets

Within partitioned datasets, it is possible to create generic expectations, meaning all the partitions will use that expectation, or specific expectations, meaning only the specified partition will use the generated expectation.

Run the following command to create an expectation suite for a given Kedro data set automatically:

kedro expectations create-suite

Hook Usage

In order to enable the hook capabilities you only need to call it in the settings.py file inside your kedro project

(inside src/your_project_name/settings.py)

from kedro_expectations import KedroExpectationsHooks

HOOKS = (KedroExpectationsHooks(
            on_failure="raise_fast",
            ),
        )

Fail Fast

on_failure is a parameter added to give more control over the pipeline. That way it is possible to define, if an expectations validation failure breaks the pipeline run immediately (on_failure="raise_fast"), at the end (on_failure="raise_later) or not at all(on_failure="continue"). Its default value is "continue".

Notification

With the notify_config argument you can set up automatic notifications about the validation run. It uses the GX checkpoint actions to render and send the notification. Currently only notification via email is supported. To set it up, add the following argument to the KedroExpectationsHooks object within settings.py and modify the addresses and credentials according to your SMTP server and needs.

from kedro_expectations import KedroExpectationsHooks
from kedro_expectations.notification import EmailNotifier

HOOKS = (KedroExpectationsHooks(
            notify_config=EmailNotifier(
              recipients=["john_doe@nobody.io", ],
              sender_login="login",
              sender_password="password",
              smtp_address="smtp.address",
              smtp_port="465"
              )
            ),
        )

Example

To make things clearer, the following example will approach the most complex usage of the plugin, which is when we want to create an specific expectation for a partitioned dataset. It was done using the Partitioned Iris Starter

To start using the plugin, make sure you are in your project's root folder and your pipeline is executing correctly.

Considering you have the plugin installed and the conditions right above are true, the main steps are:

  • Run the init command
  • Create one or more suites depending on your needs
  • Make sure to enable the KedroExpectationsHooks in your project's settings
  • Execute the Kedro Pipeline normally

Init and Suite Creation

The first step to use the plugin is to use the "kedro expectations init" command. Below we can see the expected result:

As soon as it is created, we can run the second command: "kedro expectations create-suite" You will be prompted to choose between (1) suites for generic datasets and (2) suites for partitioned datasets:

Then we can choose between a generic or specific expectation. In this example, we will press (2) to create a specific one:

Now the plugin will ask three questions. The first two must be answered based on your project, and the last one is any name based on your preference

Our partitioned dataset structure inside the project:

Questions asked by the CLI:

The last step is to decide if we want to exclude some columns from the expectation suite. Whenever you selected your desired columns, type 0:

Then your dataset will be validated automatically and will be found at great_expectations/expectations/"your_dataset_name"/"your_expectation_name"

Adding the Hook

Now, to be able to test, we only need to add a few lines of code in our settings.py file as shown above

For more information about the functionality of Kedro Hooks, please refer to the Kedro Hook Documentation

Running the Kedro project

After adding the Hook there is no extra step. You can simply run the project by typing the default "kedro run" command. Whenever a dataset with an existing expectation suite is called by the pipeline, Kedro Expectations will validate it, add the results to the data_docs and (optionally) notify you.

Contribution

Based on work from Joao Gabriel Pampanin de Abreu. Extended and updated by anacision GmbH since 2023. Main Developer:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro-expectations-0.4.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

kedro_expectations-0.4.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file kedro-expectations-0.4.1.tar.gz.

File metadata

  • Download URL: kedro-expectations-0.4.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for kedro-expectations-0.4.1.tar.gz
Algorithm Hash digest
SHA256 0a4c0210c9a3cb0ab8a7fdd74e6082a43033d993c17544c6c1660067ea8c325a
MD5 7ecdcfbae170d8c2a190bde8611b596b
BLAKE2b-256 c02183b8adea50ca143618892988ab3b6bb40cc62779eb0a63d3a7dbe0c00ee9

See more details on using hashes here.

File details

Details for the file kedro_expectations-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for kedro_expectations-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9c18fc32f00206db39025ec9684a5e0be73cae574b0c35faabcd033129358376
MD5 c222df960afe0d651503f2f3a110ae1d
BLAKE2b-256 26f8dfaf9142b97f3e4473cdc37038c0774cef6c8352cfc6a9a3e4307d5d460f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page