Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.

Project description

About dq-suite-amsterdam

This repository aims to be an easy-to-use wrapper for the data quality library Great Expectations (GX). All that is needed to get started is an in-memory Spark dataframe and a set of data quality rules - specified in a JSON file of particular formatting.

By default, all the validation results are written to Unity Catalog of DMT (dpd1_prd). Data Team User or Service principle (SPN) which runs jobs/notebook will be given access to DMT catalog to write results in data_quality schema. Based on the results, DQ reports can be viewed in power bi reports hosted by DMT. Alternatively, one could disallow writing to a data_quality schema in UC, which one has to create once per catalog via this notebook. Additionally, users can choose to get notified via Slack or Microsoft Teams.

DISCLAIMER: The package is in MVP phase, so watch your step.

How to contribute

Want to help out? Great! Feel free to create a pull request addressing one of the open issues. Some notes for developers are located here.

Found a bug, or need a new feature? Add a new issue describing what you need.

Getting started

Following GX, we recommend installing dq-suite-amsterdam in a virtual environment. This could be either locally via your IDE, on your compute via a notebook in Databricks, or as part of a workflow.

Run the following command:

pip install dq-suite-amsterdam

Create the data_quality schema (and tables all results will be written to) by running the SQL notebook located here. All it needs is the name of the catalog - and the rights to create a schema within that catalog :)
Get ready to validate your first table. To do so, define

dq_rule_json_path as a path to a JSON file, formatted in this way. Detailed description for defining the json can be found here
df as a Spark dataframe containing the table that needs to be validated (e.g. via spark.read.csv or spark.read.table)
spark as a SparkSession object (in Databricks notebooks, this is by default called spark)
catalog_name as the name of catalog where output of dq suite will be stored ('dpd1_dev' or 'dpd1_prd')
table_name as the name of the table for which a data quality check is required. This name should also occur in the JSON file at dq_rule_json_path

Finally, perform the validation by running (note: the library is imported as dq_suite, not as dq_suite_amsterdam!)

from dq_suite.validation import run_validation

run_validation(
    json_path=dq_rule_json_path,
    df=df, 
    spark_session=spark,
    catalog_name=catalog_name,
    table_name=table_name,
)

Note: run_validation now returns a tuple as (validation_result, highest_severity_level):

validation_result → Boolean flag indicating overall success (True if all checks pass, False otherwise).

highest_severity_level → String indicating the highest severity among failed checks (one of 'fatal', 'error', 'warning', or 'ok').

See the documentation of dq_suite.validation.run_validation for what other parameters can be passed.

Geo Validation

Geo validation enables geometric checks using Databricks ST geospatial functions. It is fully integrated into the existing validation flow, allowing generic and geo rules to be applied together on the same table.

Geo validation can be used to validate, among others:

Whether geometry values are present and non-empty
Whether geometries are structurally valid (e.g. no invalid polygons)
Whether geometry values are of a specific geometry type (e.g. POINT, POLYGON)

Databricks Runtime 17.1 and above must be applied on your Databricks cluster, as ST geospatial functions are only fully supported from this version onwards. For more details, https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-st-geospatial-functions
When defining rules in Getting started → Step 3, you can enable geo validation by adding the parameter "rule_type": "geo" inside your JSON. Example is here
Results of geo validation will be written into the same data_quality schema as generic validation. If a table includes both generic and geo rules, all results will be combined in the output tables.

Profiling

Profiling is the process of analyzing a dataset to understand its structure, patterns, and data quality characteristics (such as completeness, uniqueness, or value distributions).

The profiling functionality in dq_suite generates profiling results and automatically produces a rules.json file, which can be used as input for the validation—making it easier to gain insights and validate data quality.

Run the following command:

pip install dq-suite-amsterdam

1. Create the data_quality schema (and profiling tables that store profiling results) by running the SQL notebook located here. All it needs is the name of the catalog and the rights to create a schema within that catalog. The catalog allows flexible usage across environments (e.g. dev, test, prod). This step will create the required profiling tables, including:

profilingtabel (table-level profiling results)
profilingattribuut (attribute-level profiling results)

Get ready to profile your first table. To do so, define

df as a Panda dataframe containing the table that needs to be validated (e.g. via pd.read_csv)
generate_rules as a Boolean to generate dq_rule_json. Set to False if you only want profiling without rule generation
spark as a SparkSession object (in Databricks notebooks, this is by default called spark)
dq_rule_json_path as a path to a JSON file, wil be formatted in this way after running profiling function
dataset_name as the name of the table for which a data quality check is required. This name will be placed in the JSON file at dq_rule_json_path
table_name as the name of the table for which a data quality check is required. This name will be placed in the JSON file at dq_rule_json_path
catalog_name as the name of your catalog ('dpxx_dev' or 'dpxx_prd')

Finally, perform the profiling by running

from dq_suite.profile.profile import profile_and_create_rules

profile_and_create_rules(
    df=df,
    dataset_name=dataset_name,
    table_name=table_name,
    catalog_name=catalog_name,
    spark_session=spark,
    generate_rules=True,
    rule_path=dq_rule_json_path
)

Result of profiling

Profiling results are created in an HTML view. The rule.json file is created at the specified path(if generate_rules=True) This file can be edited to refine the rules according to your data validation needs. The JSON rule file can then be used as input for dq_suite validation. Profiling tables are created at the table level and include attributes of each table. Geographic rules, as described in the Geo Validation section, are automatically generated for geometry columns.

For further documentation, see:

other functionalities
notes for developers
notes for data engineers at Gemeente Amsterdam (in Dutch, employees only)

Known exceptions / issues

The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.
Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the dq-suite-amsterdam library.
At time of writing (late Aug 2024), Great Expectations v1.0.0 has just been released, and is not (yet) compatible with Python 3.12. Hence, make sure you are using the correct version of Python as interpreter for your project.
The run_time value is defined separately from Great Expectations in validation.py. We plan on fixing this when Great Expectations has documented how to access it from the RunIdentifier object.
Profiling rules/Rule condition logic

Current profiling-based rule conditions are placeholders and should be defined and validated by the data teams to ensure they are generic and reusable.

When using Great Expectations with ResultFormat.COMPLETE, the unexpected_list is limited to a maximum of 200 values per expectation. This is a limitation imposed by Great Expectations.

Project details

Release history Release notifications | RSS feed

0.14.2

May 5, 2026

This version

0.14.1

Apr 16, 2026

0.14.0

Apr 16, 2026

0.13.5

Mar 25, 2026

0.13.4

Mar 12, 2026

0.13.3

Mar 6, 2026

0.13.2

Feb 23, 2026

0.13.1

Feb 18, 2026

0.13.0

Jan 16, 2026

0.12.9

Nov 3, 2025

0.12.8

Oct 29, 2025

0.12.7

Oct 29, 2025

0.12.6

Oct 20, 2025

0.12.5

Oct 14, 2025

0.12.4

Sep 10, 2025

0.12.3

Sep 8, 2025

0.12.2

Sep 8, 2025

0.12.1

Sep 8, 2025

0.12.0

Sep 8, 2025

0.11.21

Aug 12, 2025

0.11.20

Jul 31, 2025

0.11.19

May 20, 2025

0.11.18

May 15, 2025

0.11.17

Apr 30, 2025

0.11.16

Apr 30, 2025

0.11.15

Apr 15, 2025

0.11.14

Apr 8, 2025

0.11.13

Mar 19, 2025

0.11.12

Mar 17, 2025

0.11.11

Feb 17, 2025

0.11.10

Jan 24, 2025

0.11.9

Jan 17, 2025

0.11.8

Jan 16, 2025

0.11.7

Dec 10, 2024

0.11.6

Nov 19, 2024

0.11.5

Nov 19, 2024

0.11.4

Nov 18, 2024

0.11.3

Nov 14, 2024

0.11.1

Nov 7, 2024

0.11.0

Nov 6, 2024

0.10.6

Oct 23, 2024

0.10.5

Oct 23, 2024

0.10.3

Sep 23, 2024

0.10.2

Sep 20, 2024

0.10.1

Sep 20, 2024

0.10.0

Sep 20, 2024

0.9.1

Sep 16, 2024

0.7.7

Aug 29, 2024

0.7.6

Aug 29, 2024

0.7.5

Aug 27, 2024

0.7.4

Aug 21, 2024

0.7.3

Aug 21, 2024

0.7.1

Aug 20, 2024

0.7.0

Aug 13, 2024

0.6.3

Aug 9, 2024

0.6.2

Aug 7, 2024

0.6.1

Aug 7, 2024

0.6.0

Aug 6, 2024

0.5.3

Jul 31, 2024

0.5.2

Jul 31, 2024

0.5.1

Jul 26, 2024

0.5.0

Jul 25, 2024

0.4.0

Jul 17, 2024

0.3.1

Jul 12, 2024

0.3.0

Jul 12, 2024

0.2.2

Jun 19, 2024

0.2.1

Apr 4, 2024

0.2.0

Jan 12, 2024

0.1.0

Dec 7, 2023

0.0.10

Nov 21, 2023

0.0.9

Nov 20, 2023

0.0.8

Nov 20, 2023

0.0.7

Nov 20, 2023

0.0.6

Nov 20, 2023

0.0.5

Nov 20, 2023

0.0.4

Nov 17, 2023

0.0.3

Nov 17, 2023

0.0.2

Nov 17, 2023

0.0.1

Nov 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_suite_amsterdam-0.14.1.tar.gz (52.6 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dq_suite_amsterdam-0.14.1-py3-none-any.whl (41.7 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file dq_suite_amsterdam-0.14.1.tar.gz.

File metadata

Download URL: dq_suite_amsterdam-0.14.1.tar.gz
Upload date: Apr 16, 2026
Size: 52.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dq_suite_amsterdam-0.14.1.tar.gz
Algorithm	Hash digest
SHA256	`c828014e74b6bb7c79cc27f38795d358fbeebde7229615a3462a7aa0879b256b`
MD5	`2adca2dc822a7d27802b0aff447eadc5`
BLAKE2b-256	`2ae2c581cf4f98c9dca63957e548dd44b2406f95c62995a8f0b65348b52e9620`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_suite_amsterdam-0.14.1.tar.gz:

Publisher: publish-to-pypi.yml on Amsterdam/dq-suite-amsterdam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dq_suite_amsterdam-0.14.1.tar.gz
- Subject digest: c828014e74b6bb7c79cc27f38795d358fbeebde7229615a3462a7aa0879b256b
- Sigstore transparency entry: 1317438254
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: Amsterdam/dq-suite-amsterdam@5f68e2203f9f46ddf6cfb3765126fe878462093c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Amsterdam
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@5f68e2203f9f46ddf6cfb3765126fe878462093c
- Trigger Event: push

File details

Details for the file dq_suite_amsterdam-0.14.1-py3-none-any.whl.

File metadata

Download URL: dq_suite_amsterdam-0.14.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 41.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dq_suite_amsterdam-0.14.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6ba7425e5603ee878f42d46a1d67fb8f98802bce0f8a85d9fad160f289e987f`
MD5	`a627a90244cdfc3c4d828e7cdca96f6e`
BLAKE2b-256	`3650199b8f700f6bc5fd83f4bf5312f204b38698f8e73a46036c8bf38a889c19`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_suite_amsterdam-0.14.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on Amsterdam/dq-suite-amsterdam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dq_suite_amsterdam-0.14.1-py3-none-any.whl
- Subject digest: d6ba7425e5603ee878f42d46a1d67fb8f98802bce0f8a85d9fad160f289e987f
- Sigstore transparency entry: 1317438347
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: Amsterdam/dq-suite-amsterdam@5f68e2203f9f46ddf6cfb3765126fe878462093c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Amsterdam
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@5f68e2203f9f46ddf6cfb3765126fe878462093c
- Trigger Event: push

dq-suite-amsterdam 0.14.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

About dq-suite-amsterdam

How to contribute

Getting started

Known exceptions / issues

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance