Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.

Project description

Introduction

This repository contains functions that will ease the use of Great Expectations. Users can input data and data quality rules and get results in return.

DISCLAIMER: The package is in MVP phase

Getting started

Install the dq suite on your compute, for example by running the following code in your workspace:

pip install dq-suite-amsterdam

To validate your first table:

define dq_rule_json_path as a path to a JSON file, similar to shown in dq_rules_example.json in this repo
define table_name as the name of the table for which a data quality check is required. This name should also occur in the JSON file
load the table requiring a data quality check into a PySpark dataframe df (e.g. via spark.read.csv or spark.read.table)

import dq_suite

validation_settings_obj = dq_suite.ValidationSettings(spark_session=spark, 
                                                      catalog_name="dpxx_dev",
                                                      table_name=table_name,
                                                      check_name="name_of_check_goes_here")
dq_suite.run(json_path=dq_rule_json_path, df=df, validation_settings_obj=validation_settings_obj)

Looping over multiple data frames may require a redefinition of the json_path and validation_settings variables.

See the documentation of ValidationSettings for what other parameters can be passed upon intialisation (e.g. Slack or MS Teams webhooks for notifications, location for storing GX, etc).

Create data quality schema and tables (in respective catalog of data team)

Before running your first dq check, create the data quality schema and tables from the notebook from repo path: scripts/data_quality_tables.sql

Open the notebook, connect to a cluster.
Select the catalog of the data team and execute the notebook. It will create the schema and tables if they are not yet there.

Export the schema from Unity Catalog to the Input Form

In order to output the schema from Unity Catalog, use the following commands (using the required schema name):

schema_output = dq_suite.schema_to_json_string('schema_name', spark)
print(schema_output)

Copy the string to the Input Form to quickly ingest the schema in Excel.

Validate the schema of a table

It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json).

You will need:

validate_table_schema: the id field of the table from Amsterdam Schema
validate_table_schema_url: the url of the table or dataset from Amsterdam Schema

The schema definition is converted into column level expectations (expect_column_values_to_be_of_type) on run time.

Known exceptions

The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.
Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the dq-suite-amsterdam library.
At time of writing (late Aug 2024), Great Expectations v1.0.0 has just been released, and is not (yet) compatible with Python 3.12. Hence, make sure you are using the correct version of Python as interpreter for your project.
The run_time is defined separately from Great Expectations in df_checker. We plan on fixing it when Great Expectations has documented how to access it from the RunIdentifier object.

Contributing to this library

See the separate developers' readme.

Updates

Version 0.1: Run a DQ check for a dataframe

Version 0.2: Run a DQ check for multiple dataframes

Version 0.3: Refactored I/O

Version 0.4: Added schema validation with Amsterdam Schema per table

Version 0.5: Export schema from Unity Catalog

Version 0.6: The results are written to tables in the "dataquality" schema

Version 0.7: Refactored the solution

Version 0.8: Implemented output historization

Version 0.9: Added dataset descriptions

Version 0.10: Switched to GX 1.0

Project details

Release history Release notifications | RSS feed

0.11.3

Nov 14, 2024

0.11.1

Nov 7, 2024

0.11.0

Nov 6, 2024

0.10.6

Oct 23, 2024

This version

0.10.5

Oct 23, 2024

0.10.3

Sep 23, 2024

0.10.2

Sep 20, 2024

0.10.1

Sep 20, 2024

0.10.0

Sep 20, 2024

0.9.1

Sep 16, 2024

0.7.7

Aug 29, 2024

0.7.6

Aug 29, 2024

0.7.5

Aug 27, 2024

0.7.4

Aug 21, 2024

0.7.3

Aug 21, 2024

0.7.1

Aug 20, 2024

0.7.0

Aug 13, 2024

0.6.3

Aug 9, 2024

0.6.2

Aug 7, 2024

0.6.1

Aug 7, 2024

0.6.0

Aug 6, 2024

0.5.3

Jul 31, 2024

0.5.2

Jul 31, 2024

0.5.1

Jul 26, 2024

0.5.0

Jul 25, 2024

0.4.0

Jul 17, 2024

0.3.1

Jul 12, 2024

0.3.0

Jul 12, 2024

0.2.2

Jun 19, 2024

0.2.1

Apr 4, 2024

0.2.0

Jan 12, 2024

0.1.0

Dec 7, 2023

0.0.10

Nov 21, 2023

0.0.9

Nov 20, 2023

0.0.8

Nov 20, 2023

0.0.7

Nov 20, 2023

0.0.6

Nov 20, 2023

0.0.5

Nov 20, 2023

0.0.4

Nov 17, 2023

0.0.3

Nov 17, 2023

0.0.2

Nov 17, 2023

0.0.1

Nov 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_suite_amsterdam-0.10.5.tar.gz (19.1 kB view details)

Uploaded Oct 23, 2024 Source

Built Distribution

dq_suite_amsterdam-0.10.5-py3-none-any.whl (18.1 kB view details)

Uploaded Oct 23, 2024 Python 3

File details

Details for the file dq_suite_amsterdam-0.10.5.tar.gz.

File metadata

Download URL: dq_suite_amsterdam-0.10.5.tar.gz
Upload date: Oct 23, 2024
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dq_suite_amsterdam-0.10.5.tar.gz
Algorithm	Hash digest
SHA256	`90b5c03a9b79d76e21fecb36637443b5813fba7d91586fc8e15d7c99b42766ac`
MD5	`7da0c5fbcc1a64c870a28ec810a3d769`
BLAKE2b-256	`a0c05093060162905a90f2f7a45f714f3da1ba31320938e7140a3d08a3d9fd28`

See more details on using hashes here.

File details

Details for the file dq_suite_amsterdam-0.10.5-py3-none-any.whl.

File metadata

Download URL: dq_suite_amsterdam-0.10.5-py3-none-any.whl
Upload date: Oct 23, 2024
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dq_suite_amsterdam-0.10.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c06b35fe7f776c1030a97e3d217474c14a492c938fb341395eb0f2685c9f96e`
MD5	`fde2e3414c64a96aa3d816a6d87aaa97`
BLAKE2b-256	`689ac3d59ea4f6bcb9bde99d1ca2178508ef750ef7304aed3c3745962864d3ca`