Skip to main content

Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations

Project description

🎯 DataFrameExpectations

CI Publish to PyPI PyPI version PyPI downloads Python 3.10+ License: Apache 2.0 Documentation

DataFrameExpectations is a Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations. It simplifies testing in data pipelines and end-to-end workflows by providing a standardized framework for DataFrame validation.

Instead of using different validation approaches for DataFrames, this library provides a standardized solution for this use case. As a result, any contributions made here—such as adding new expectations—can be leveraged by all users of the library.

📚 View Documentation | 📋 List of Expectations

Installation:

pip install dataframe-expectations

Development setup

To set up the development environment:

# 1. Clone the repository
git clone https://github.com/getyourguide/dataframe-expectations.git
cd dataframe-expectations

# 2. Install UV package manager
pip install uv

# 3. Install development dependencies (this will automatically create a virtual environment)
uv sync --group dev

# 4. (Optional) To explicitly activate the virtual environment:
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# 5. Run tests (this will run the tests in the virtual environment)
uv run pytest tests/ --cov=dataframe_expectations

Using the library

Pandas example:

from dataframe_expectations.expectations_suite import DataFameExpectationsSuite

suite = (
    DataFrameExpectationsSuite()
    .expect_value_greater_than("age", 18)
    .expect_value_less_than("age", 10)
)

# Create a Pandas DataFrame
import pandas as pd
test_pandas_df = pd.DataFrame({"age": [20, 15, 30], "name": ["Alice", "Bob", "Charlie"]})

suite.run(test_pandas_df)

PySpark example:

from dataframe_expectations.expectations_suite import DataFrameExpectationsSuite

suite = (
    DataFrameExpectationsSuite()
    .expect_value_greater_than("age", 18)
    .expect_value_less_than("age", 40)
)

# Create a PySpark DataFrame
test_spark_df = spark.createDataFrame(
    [
        {"name": "Alice", "age": 20},
        {"name": "Bob", "age": 15},
        {"name": "Charlie", "age": 30},
    ]
)

suite.run(test_spark_df)

Output:

========================== Running expectations suite ==========================
ExpectationValueGreaterThan ('age' greater than 18) ... FAIL
ExpectationValueLessThan ('age' less than 40) ... OK
============================ 1 success, 1 failures =============================

ExpectationSuiteFailure: (1/2) expectations failed.

================================================================================
List of violations:
--------------------------------------------------------------------------------
[Failed 1/1] ExpectationValueGreaterThan ('age' greater than 18): Found 1 row(s) where 'age' is not greater than 18.
Some examples of violations:
+-----+------+
| age | name |
+-----+------+
| 15  | Bob  |
+-----+------+
================================================================================

How to contribute?

Contributions are welcome! You can enhance the library by adding new expectations, refining existing ones, or improving the testing framework.

Versioning

This project follows Semantic Versioning (SemVer) and uses Release Please for automated version management.

Versions are automatically determined based on Conventional Commits:

  • feat: - New feature → MINOR version bump (0.1.0 → 0.2.0)
  • fix: - Bug fix → PATCH version bump (0.1.0 → 0.1.1)
  • feat!: or BREAKING CHANGE: - Breaking change → MAJOR version bump (0.1.0 → 1.0.0)
  • chore:, docs:, style:, refactor:, test:, ci: - No version bump

Example commits:

git commit -m "feat: add new expectation for null values"
git commit -m "fix: correct validation logic in expect_value_greater_than"
git commit -m "feat!: remove deprecated API methods"

When changes are pushed to the main branch, Release Please automatically:

  1. Creates or updates a Release PR with version bump and changelog
  2. When merged, creates a GitHub Release and publishes to PyPI

No manual version updates needed - just use conventional commit messages!

Security

For security issues please contact security@getyourguide.com.

Legal

dataframe-expectations is licensed under the Apache License, Version 2.0. See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataframe_expectations-0.2.0.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataframe_expectations-0.2.0-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file dataframe_expectations-0.2.0.tar.gz.

File metadata

  • Download URL: dataframe_expectations-0.2.0.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataframe_expectations-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1e26b989d78c0c3ee273a338dbd0821819d5c08661aaa11db7ff1f42c87998c0
MD5 d996f997e467e1a7dd0f0500e1451b7f
BLAKE2b-256 961f5363541b3d2b4e8e6170171b4f2880442edcd454f284c8f126d86b35cdc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.2.0.tar.gz:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataframe_expectations-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataframe_expectations-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba895359bfadc23afd6577e2dc98127fc7dcc59bf73008a074219e16c1d266d2
MD5 ad86e0b92057662063bb55379e95799c
BLAKE2b-256 46f133c5144f7065ea900773080b25170b7ea5ee8915b09148701480f23d84b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.2.0-py3-none-any.whl:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page