PyDeequ - Unit Tests for Data

These details have not been verified by PyPI

Project links

Project description

PyDeequ

PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.

Coverage

There are 4 main components of Deequ, and they are:

Metrics Computation:
- Profiles leverages Analyzers to analyze each column of a dataset.
- Analyzers serve here as a foundational module that computes metrics for data profiling and validation at scale.
Constraint Suggestion:
- Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
Constraint Verification:
- Perform data validation on a dataset with respect to various constraints set by you.
Metrics Repository
- Allows for persistence and tracking of Deequ runs over time.

🎉 Announcements 🎉

NEW!!! The 1.4.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release adds support for Spark 3.5.0.
The latest version of Deequ, 2.0.7, is made available With Python Deequ 1.3.0.
1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recent upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable SPARK_VERSION to specify your Spark version!
We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: Monitor data quality in your data lake using PyDeequ and AWS Glue.
Check out the PyDeequ Release Announcement Blogpost with a tutorial walkthrough the Amazon Reviews dataset!
Join the PyDeequ community on PyDeequ Slack to chat with the devs!

Quickstart

The following will quickstart you with some basic usage. For more in-depth examples, take a look in the tutorials/ directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the documentation.

Installation

You can install PyDeequ via pip.

pip install pydeequ

Set up a PySpark session

from pyspark.sql import SparkSession, Row
import pydeequ

spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

df = spark.sparkContext.parallelize([
            Row(a="foo", b=1, c=5),
            Row(a="bar", b=2, c=6),
            Row(a="baz", b=3, c=None)]).toDF()

Analyzers

from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

Profile

from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \
    .onData(df) \
    .run()

for col, profile in result.profiles.items():
    print(profile)

Constraint Suggestions

from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

# Constraint Suggestions in JSON format
print(suggestionResult)

Constraint Verification

from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x >= 3) \
        .hasMin("b", lambda x: x == 0) \
        .isComplete("c")  \
        .isUnique("a")  \
        .isContainedIn("a", ["foo", "bar", "baz"]) \
        .isNonNegative("b")) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

Repository

Save to a Metrics Repository by adding the useRepository() and saveOrAppendResult() calls to your Analysis Runner.

from pydeequ.repository import *
from pydeequ.analyzers import *

metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')
repository = FileSystemMetricsRepository(spark, metrics_file)
key_tags = {'tag': 'pydeequ hello world'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

analysisResult = AnalysisRunner(spark) \
    .onData(df) \
    .addAnalyzer(ApproxCountDistinct('b')) \
    .useRepository(repository) \
    .saveOrAppendResult(resultKey) \
    .run()

To load previous runs, use the repository object to load previous results back in.

result_metrep_df = repository.load() \
    .before(ResultKey.current_milli_time()) \
    .forAnalyzers([ApproxCountDistinct('b')]) \
    .getSuccessMetricsAsDataFrame()

Wrapping up

After you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes.

spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

Contributing

Please refer to the contributing doc for how to contribute to PyDeequ.

License

This library is licensed under the Apache 2.0 License.

Contributing Developer Setup

Setup SDKMAN
Setup Java
Setup Apache Spark
Install Poetry
Run tests locally

Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based system. It provides a convenient command line interface for installing, switching, removing and listing Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See documentation on the SDKMAN! website.

Open your favourite terminal and enter the following:

$ curl -s https://get.sdkman.io | bash
If the environment needs tweaking for SDKMAN to be installed,
the installer will prompt you accordingly and ask you to restart.

Next, open a new terminal or enter:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

Lastly, run the following code snippet to ensure that installation succeeded:

$ sdk version

Setup Java

Install Java Now open favourite terminal and enter the following:

List the AdoptOpenJDK OpenJDK versions
$ sdk list java

To install For Java 11
$ sdk install java 11.0.10.hs-adpt

To install For Java 11
$ sdk install java 8.0.292.hs-adpt

Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

List the Apache Spark versions:
$ sdk list spark

To install For Spark 3
$ sdk install spark 3.0.2

Poetry

Poetry Commands

poetry install

poetry update

# --tree: List the dependencies as a tree.
# --latest (-l): Show the latest version.
# --outdated (-o): Show the latest version but only for packages that are outdated.
poetry show -o

Running Tests Locally

Take a look at tests in tests/dataquality and tests/jobs

$ poetry run pytest

Running Tests Locally (Docker)

If you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.

docker build . -t spark-3.3-docker-test
docker run spark-3.3-docker-test

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.0

Apr 1, 2025

1.4.0

Jul 2, 2024

1.3.0

Apr 26, 2024

1.2.0

Dec 13, 2023

1.1.1

Sep 21, 2023

1.1.0

Jul 6, 2023

1.1.0rc0 pre-release

Jun 19, 2023

1.0.1

Jul 29, 2021

1.0.0

Jul 22, 2021

0.1.8

Jul 19, 2021

0.1.7

May 20, 2021

0.1.6

May 11, 2021

0.1.5

Nov 13, 2020

0.1.2

Nov 9, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydeequ-1.5.0.tar.gz (35.5 kB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydeequ-1.5.0-py3-none-any.whl (37.7 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file pydeequ-1.5.0.tar.gz.

File metadata

Download URL: pydeequ-1.5.0.tar.gz
Upload date: Apr 1, 2025
Size: 35.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for pydeequ-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`81f943e93723bba3258cdfde4641de5cca28e228775c87ad84ee2f74fcdacce8`
MD5	`6692fa7beefa937d4a983485da3e4e40`
BLAKE2b-256	`034a6388c746fd93dce87a473fe817190498801beb2556a55bce4f020a1e58be`

See more details on using hashes here.

File details

Details for the file pydeequ-1.5.0-py3-none-any.whl.

File metadata

Download URL: pydeequ-1.5.0-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 37.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for pydeequ-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1735256c6506ca9ecac9406fce3a0a0ed6bce4daef2fca1abe901d96a4ae3edd`
MD5	`0b464200c1286200af898f82d9894507`
BLAKE2b-256	`e566ab5c84ec4ab22923addc5d1126231af1a05e767ac29bd13cb4c6a7eb2b1d`

See more details on using hashes here.

pydeequ 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyDeequ

🎉 Announcements 🎉

Quickstart

Installation

Set up a PySpark session

Analyzers

Profile

Constraint Suggestions

Constraint Verification

Repository

Wrapping up

Contributing

License

Contributing Developer Setup

Setup SDKMAN

Setup Java

Setup Apache Spark

Poetry

Running Tests Locally

Running Tests Locally (Docker)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes