dataframe-expectations

Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations

Project description

🎯 DataFrameExpectations

Publish to PyPI PyPI downloads

DataFrameExpectations is a Python library designed to validate Pandas and PySpark DataFrames using customizable, reusable expectations. It simplifies testing in data pipelines and end-to-end workflows by providing a standardized framework for DataFrame validation.

Instead of using different validation approaches for DataFrames, this library provides a standardized solution for this use case. As a result, any contributions made here—such as adding new expectations—can be leveraged by all users of the library.

📚 View Documentation | 📋 List of Expectations

Installation:

pip install dataframe-expectations

Requirements

Python 3.10+
pandas >= 1.5.0
pydantic >= 2.12.4
pyspark >= 3.3.0
tabulate >= 0.8.9

Development setup

To set up the development environment:

# 1. Clone the repository
git clone https://github.com/getyourguide/dataframe-expectations.git
cd dataframe-expectations

# 2. Install UV package manager
pip install uv

# 3. Install development dependencies (this will automatically create a virtual environment)
uv sync --group dev

# 4. (Optional) To explicitly activate the virtual environment:
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# 5. Run tests (this will run the tests in the virtual environment)
uv run pytest tests/ --cov=dataframe_expectations

Using the library

Basic usage with Pandas:

from dataframe_expectations.suite import DataFrameExpectationsSuite
import pandas as pd

# Build a suite with expectations
suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Create a runner
runner = suite.build()

# Validate a DataFrame
df = pd.DataFrame({
    "age": [25, 15, 45, 22],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "salary": [50000, 60000, 80000, 45000]
})
runner.run(df)

PySpark example:

from dataframe_expectations.suite import DataFrameExpectationsSuite
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Build a validation suite (same API as Pandas!)
suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Build the runner
runner = suite.build()

# Create a PySpark DataFrame
data = [
    {"age": 25, "name": "Alice", "salary": 50000},
    {"age": 15, "name": "Bob", "salary": 60000},
    {"age": 45, "name": "Charlie", "salary": 80000},
    {"age": 22, "name": "Diana", "salary": 45000}
]
df = spark.createDataFrame(data)

# Validate
runner.run(df)

Decorator pattern for automatic validation:

from dataframe_expectations.suite import DataFrameExpectationsSuite
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Build the runner
runner = suite.build()

# Apply decorator to automatically validate function output
@runner.validate
def load_employee_data():
    """Load and return employee data - automatically validated."""
    return spark.createDataFrame(
        [
            {"age": 25, "name": "Alice", "salary": 50000},
            {"age": 15, "name": "Bob", "salary": 60000},
            {"age": 45, "name": "Charlie", "salary": 80000},
            {"age": 22, "name": "Diana", "salary": 45000}
        ]
    )

# Function execution automatically validates the returned DataFrame
df = load_employee_data()  # Raises DataFrameExpectationsSuiteFailure if validation fails

# Allow functions that may return None
@runner.validate(allow_none=True)
def conditional_load(should_load: bool):
    """Conditionally load data - validation only runs when DataFrame is returned."""
    if should_load:
        return spark.createDataFrame([{"age": 25, "name": "Alice", "salary": 50000}])
    return None  # No validation when None is returned

Output:

========================== Running expectations suite ==========================
ExpectationMinRows (DataFrame contains at least 3 rows) ... OK
ExpectationMaxRows (DataFrame contains at most 10 rows) ... OK
ExpectationValueGreaterThan ('age' is greater than 18) ... FAIL
ExpectationValueLessThan ('salary' is less than 100000) ... OK
ExpectationValueNotNull ('name' is not null) ... OK
============================ 4 success, 1 failures =============================

ExpectationSuiteFailure: (1/5) expectations failed.

================================================================================
List of violations:
--------------------------------------------------------------------------------
[Failed 1/1] ExpectationValueGreaterThan ('age' is greater than 18): Found 1 row(s) where 'age' is not greater than 18.
Some examples of violations:
+-----+------+--------+
| age | name | salary |
+-----+------+--------+
| 15  | Bob  | 60000  |
+-----+------+--------+
================================================================================

Tag-based filtering for selective execution:

from dataframe_expectations import DataFrameExpectationsSuite, TagMatchMode

# Tag expectations with priorities and environments
suite = (
    DataFrameExpectationsSuite()
    .expect_value_greater_than(column_name="age", value=18, tags=["priority:high", "env:prod"])
    .expect_value_not_null(column_name="name", tags=["priority:high"])
    .expect_min_rows(min_rows=1, tags=["priority:low", "env:test"])
)

# Run only high-priority checks (OR logic - matches ANY tag)
runner = suite.build(tags=["priority:high"], tag_match_mode=TagMatchMode.ANY)
runner.run(df)

# Run production-critical checks (AND logic - matches ALL tags)
runner = suite.build(tags=["priority:high", "env:prod"], tag_match_mode=TagMatchMode.ALL)
runner.run(df)

Programmatic result inspection:

# Get detailed results without raising exceptions
result = runner.run(df, raise_on_failure=False)

# Inspect validation outcomes
print(f"Total: {result.total_expectations}, Passed: {result.total_passed}, Failed: {result.total_failed}")
print(f"Pass rate: {result.pass_rate:.2%}")
print(f"Duration: {result.total_duration_seconds:.2f}s")
print(f"Applied filters: {result.applied_filters}")

# Access individual results
for exp_result in result.results:
    if exp_result.status == "failed":
        print(f"Failed: {exp_result.description} - {exp_result.violation_count} violations")

How to contribute?

Contributions are welcome! You can enhance the library by adding new expectations, refining existing ones, or improving the testing framework.

Versioning

This project follows Semantic Versioning (SemVer) and uses Release Please for automated version management.

Versions are automatically determined based on Conventional Commits:

feat: - New feature → MINOR version bump (0.1.0 → 0.2.0)
fix: - Bug fix → PATCH version bump (0.1.0 → 0.1.1)
feat!: or BREAKING CHANGE: - Breaking change → MAJOR version bump (0.1.0 → 1.0.0)
chore:, docs:, style:, refactor:, test:, ci: - No version bump

Example commits:

git commit -m "feat: add new expectation for null values"
git commit -m "fix: correct validation logic in expect_value_greater_than"
git commit -m "feat!: remove deprecated API methods"

When changes are pushed to the main branch, Release Please automatically:

Creates or updates a Release PR with version bump and changelog
When merged, creates a GitHub Release and publishes to PyPI

No manual version updates needed - just use conventional commit messages!

Security

For security issues please contact security@getyourguide.com.

Legal

dataframe-expectations is licensed under the Apache License, Version 2.0. See LICENSE for the full text.

Project details

Release history Release notifications | RSS feed

0.7.0

May 7, 2026

0.6.0

Mar 18, 2026

0.5.2

Mar 16, 2026

0.5.1

Jan 28, 2026

This version

0.5.0

Nov 22, 2025

0.4.0

Nov 10, 2025

0.3.0

Nov 9, 2025

0.2.0

Nov 8, 2025

0.1.1

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataframe_expectations-0.5.0.tar.gz (39.0 kB view details)

Uploaded Nov 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataframe_expectations-0.5.0-py3-none-any.whl (48.3 kB view details)

Uploaded Nov 22, 2025 Python 3

File details

Details for the file dataframe_expectations-0.5.0.tar.gz.

File metadata

Download URL: dataframe_expectations-0.5.0.tar.gz
Upload date: Nov 22, 2025
Size: 39.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataframe_expectations-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`0cdea45e707a0cfbf227b6c4bcaacc9d449469ea94ad7d3743d21df00876e515`
MD5	`2f7f18c6a5c5bc5e1658f4ad62166895`
BLAKE2b-256	`e7094483fa2711272aa69e74b01fbaa937cf3b59e47d4c52005393157f078b92`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.5.0.tar.gz:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataframe_expectations-0.5.0.tar.gz
- Subject digest: 0cdea45e707a0cfbf227b6c4bcaacc9d449469ea94ad7d3743d21df00876e515
- Sigstore transparency entry: 715616101
- Sigstore integration time: Nov 22, 2025
Source repository:
- Permalink: getyourguide/dataframe-expectations@d62ce0b402ad7fb312384845811d275d4e41b426
- Branch / Tag: refs/heads/main
- Owner: https://github.com/getyourguide
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@d62ce0b402ad7fb312384845811d275d4e41b426
- Trigger Event: workflow_dispatch

File details

Details for the file dataframe_expectations-0.5.0-py3-none-any.whl.

File metadata

Download URL: dataframe_expectations-0.5.0-py3-none-any.whl
Upload date: Nov 22, 2025
Size: 48.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataframe_expectations-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e39d88ad90dfe3e8b3c4dff16a17c03cda2de83f326655d0154af659da8d7d7a`
MD5	`18676c2e5e2b0e1561fb2d183d78c051`
BLAKE2b-256	`2a8f980338f80b6d1de9bb265094c66c85f8666c275cd529f0243d145c1dd078`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.5.0-py3-none-any.whl:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataframe_expectations-0.5.0-py3-none-any.whl
- Subject digest: e39d88ad90dfe3e8b3c4dff16a17c03cda2de83f326655d0154af659da8d7d7a
- Sigstore transparency entry: 715616108
- Sigstore integration time: Nov 22, 2025
Source repository:
- Permalink: getyourguide/dataframe-expectations@d62ce0b402ad7fb312384845811d275d4e41b426
- Branch / Tag: refs/heads/main
- Owner: https://github.com/getyourguide
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@d62ce0b402ad7fb312384845811d275d4e41b426
- Trigger Event: workflow_dispatch

dataframe-expectations 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🎯 DataFrameExpectations

Installation:

Requirements

Development setup

Using the library

How to contribute?

Versioning

Security

Legal

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance