Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations
Project description
🎯 DataFrameExpectations
DataFrameExpectations is a Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations. It simplifies testing in data pipelines and end-to-end workflows by providing a standardized framework for DataFrame validation.
Instead of using different validation approaches for DataFrames, this library provides a standardized solution for this use case. As a result, any contributions made here—such as adding new expectations—can be leveraged by all users of the library.
📚 View Documentation | 📋 List of Expectations
Installation:
pip install dataframe-expectations
Development setup
To set up the development environment:
# 1. Clone the repository
git clone https://github.com/getyourguide/dataframe-expectations.git
cd dataframe-expectations
# 2. Install UV package manager
pip install uv
# 3. Install development dependencies (this will automatically create a virtual environment)
uv sync --group dev
# 4. (Optional) To explicitly activate the virtual environment:
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# 5. Run tests (this will run the tests in the virtual environment)
uv run pytest tests/ --cov=dataframe_expectations
Using the library
Pandas example:
from dataframe_expectations.expectations_suite import DataFameExpectationsSuite
suite = (
DataFrameExpectationsSuite()
.expect_value_greater_than("age", 18)
.expect_value_less_than("age", 10)
)
# Create a Pandas DataFrame
import pandas as pd
test_pandas_df = pd.DataFrame({"age": [20, 15, 30], "name": ["Alice", "Bob", "Charlie"]})
suite.run(test_pandas_df)
PySpark example:
from dataframe_expectations.expectations_suite import DataFrameExpectationsSuite
suite = (
DataFrameExpectationsSuite()
.expect_value_greater_than("age", 18)
.expect_value_less_than("age", 40)
)
# Create a PySpark DataFrame
test_spark_df = spark.createDataFrame(
[
{"name": "Alice", "age": 20},
{"name": "Bob", "age": 15},
{"name": "Charlie", "age": 30},
]
)
suite.run(test_spark_df)
Output:
========================== Running expectations suite ==========================
ExpectationValueGreaterThan ('age' greater than 18) ... FAIL
ExpectationValueLessThan ('age' less than 40) ... OK
============================ 1 success, 1 failures =============================
ExpectationSuiteFailure: (1/2) expectations failed.
================================================================================
List of violations:
--------------------------------------------------------------------------------
[Failed 1/1] ExpectationValueGreaterThan ('age' greater than 18): Found 1 row(s) where 'age' is not greater than 18.
Some examples of violations:
+-----+------+
| age | name |
+-----+------+
| 15 | Bob |
+-----+------+
================================================================================
How to contribute?
Contributions are welcome! You can enhance the library by adding new expectations, refining existing ones, or improving the testing framework.
Versioning
This project follows Semantic Versioning (SemVer) and uses Release Please for automated version management.
Versions are automatically determined based on Conventional Commits:
feat:- New feature → MINOR version bump (0.1.0 → 0.2.0)fix:- Bug fix → PATCH version bump (0.1.0 → 0.1.1)feat!:orBREAKING CHANGE:- Breaking change → MAJOR version bump (0.1.0 → 1.0.0)
Example commits:
git commit -m "feat: add new expectation for null values"
git commit -m "fix: correct validation logic in expect_value_greater_than"
git commit -m "feat!: remove deprecated API methods"
When changes are pushed to the main branch, Release Please automatically:
- Creates or updates a Release PR with version bump and changelog
- When merged, creates a GitHub Release and publishes to PyPI
No manual version updates needed - just use conventional commit messages!
Security
For security issues please contact security@getyourguide.com.
Legal
dataframe-expectations is licensed under the Apache License, Version 2.0. See LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataframe_expectations-0.1.1.tar.gz.
File metadata
- Download URL: dataframe_expectations-0.1.1.tar.gz
- Upload date:
- Size: 31.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f16d516d509572022913c0ee977a6643184acb6fa6777c54b26770159ad555f
|
|
| MD5 |
0b59129968068f00a66966e66c102cf1
|
|
| BLAKE2b-256 |
967d408d5594ca1e67891e52387c8334fef3b4f3270026797bcb3a2576061dff
|
Provenance
The following attestation bundles were made for dataframe_expectations-0.1.1.tar.gz:
Publisher:
publish.yaml on getyourguide/dataframe-expectations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataframe_expectations-0.1.1.tar.gz -
Subject digest:
1f16d516d509572022913c0ee977a6643184acb6fa6777c54b26770159ad555f - Sigstore transparency entry: 659302875
- Sigstore integration time:
-
Permalink:
getyourguide/dataframe-expectations@3f89e950b9b2a9fdae844ad082c75e5329425722 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/getyourguide
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@3f89e950b9b2a9fdae844ad082c75e5329425722 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dataframe_expectations-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dataframe_expectations-0.1.1-py3-none-any.whl
- Upload date:
- Size: 36.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
840211ab06f577806440beeedab772578d0845a37bae33cae945ed9bd4bad079
|
|
| MD5 |
c59851a95810661b0759e980b456851e
|
|
| BLAKE2b-256 |
71e432069ab5ee6b7ec61dfaabd2c3380560a69bfb99954faa08397346e2c06e
|
Provenance
The following attestation bundles were made for dataframe_expectations-0.1.1-py3-none-any.whl:
Publisher:
publish.yaml on getyourguide/dataframe-expectations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataframe_expectations-0.1.1-py3-none-any.whl -
Subject digest:
840211ab06f577806440beeedab772578d0845a37bae33cae945ed9bd4bad079 - Sigstore transparency entry: 659302887
- Sigstore integration time:
-
Permalink:
getyourguide/dataframe-expectations@3f89e950b9b2a9fdae844ad082c75e5329425722 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/getyourguide
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@3f89e950b9b2a9fdae844ad082c75e5329425722 -
Trigger Event:
push
-
Statement type: