Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.
Project description
About dq-suite-amsterdam
This repository aims to be an easy-to-use wrapper for the data quality library Great Expectations (GX). All that is needed to get started is an in-memory Spark dataframe and a set of data quality rules - specified in a JSON file of particular formatting.
While the results of all validations are written to a data_quality
schema in Unity Catalog, users can also choose to get notified via Slack or Microsoft Teams.
DISCLAIMER: The package is in MVP phase, so watch your step.
How to contribute
Want to help out? Great! Feel free to create a pull request addressing one of the open issues. Some notes for developers are located here.
Found a bug, or need a new feature? Add a new issue describing what you need.
Getting started
Following GX, we recommend installing dq-suite-amsterdam
in a virtual environment. This could be either locally via your IDE, on your compute via a notebook in Databricks, or as part of a workflow.
- Run the following command:
pip install dq-suite-amsterdam
-
Create the
data_quality
schema (and tables all results will be written to) by running the SQL notebook located here. All it needs is the name of the catalog - and the rights to create a schema within that catalog :) -
Get ready to validate your first table. To do so, define
dq_rule_json_path
as a path to a JSON file, formatted in this waydf
as a Spark dataframe containing the table that needs to be validated (e.g. viaspark.read.csv
orspark.read.table
)spark
as a SparkSession object (in Databricks notebooks, this is by default calledspark
)catalog_name
as the name of your catalog ('dpxx_dev' or 'dpxx_prd')table_name
as the name of the table for which a data quality check is required. This name should also occur in the JSON file atdq_rule_json_path
- Finally, perform the validation by running
import dq_suite
dq_suite.validation.run(
json_path=dq_rule_json_path,
df=df,
spark_session=spark,
catalog_name=catalog_name,
table_name=table_name,
validation_name="my_validation_name",
)
See the documentation of dq_suite.validation.run
for what other parameters can be passed.
Known exceptions / issues
-
The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.
-
Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the
dq-suite-amsterdam
library. -
At time of writing (late Aug 2024), Great Expectations v1.0.0 has just been released, and is not (yet) compatible with Python 3.12. Hence, make sure you are using the correct version of Python as interpreter for your project.
-
The
run_time
value is defined separately from Great Expectations invalidation.py
. We plan on fixing this when Great Expectations has documented how to access it from the RunIdentifier object.
Updates
Version 0.1: Run a DQ check for a dataframe
Version 0.2: Run a DQ check for multiple dataframes
Version 0.3: Refactored I/O
Version 0.4: Added schema validation with Amsterdam Schema per table
Version 0.5: Export schema from Unity Catalog
Version 0.6: The results are written to tables in the "dataquality" schema
Version 0.7: Refactored the solution
Version 0.8: Implemented output historization
Version 0.9: Added dataset descriptions
Version 0.10: Switched to GX 1.0
Version 0.11: Stability and testability improvements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dq_suite_amsterdam-0.11.3.tar.gz
.
File metadata
- Download URL: dq_suite_amsterdam-0.11.3.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | baa7305456b2d254e7a425d8746774e6b96675a70a0dc179f61327d392f2c390 |
|
MD5 | aefbd428b1f55c00afb6b05db99bd028 |
|
BLAKE2b-256 | 33d73bede49a08ce06f0fc98a30837ff13d18e40493a0ad6c49875f874b2a09c |
Provenance
The following attestation bundles were made for dq_suite_amsterdam-0.11.3.tar.gz
:
Publisher:
publish-to-pypi.yml
on Amsterdam/dq-suite-amsterdam
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
dq_suite_amsterdam-0.11.3.tar.gz
- Subject digest:
baa7305456b2d254e7a425d8746774e6b96675a70a0dc179f61327d392f2c390
- Sigstore transparency entry: 148891487
- Sigstore integration time:
- Predicate type:
File details
Details for the file dq_suite_amsterdam-0.11.3-py3-none-any.whl
.
File metadata
- Download URL: dq_suite_amsterdam-0.11.3-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2d7b8388d7d3d04a2d321626dd840862c32c35ccc0295996f7a5dd9da6e5ec8 |
|
MD5 | 04d556c480c7cf1fb94e73fbed64ff95 |
|
BLAKE2b-256 | 3c38af426c658ebce06292c7b9e24520d1bdf1b5d86de07250a5dba5f82847f3 |
Provenance
The following attestation bundles were made for dq_suite_amsterdam-0.11.3-py3-none-any.whl
:
Publisher:
publish-to-pypi.yml
on Amsterdam/dq-suite-amsterdam
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
dq_suite_amsterdam-0.11.3-py3-none-any.whl
- Subject digest:
f2d7b8388d7d3d04a2d321626dd840862c32c35ccc0295996f7a5dd9da6e5ec8
- Sigstore transparency entry: 148891488
- Sigstore integration time:
- Predicate type: