Skip to main content

Python library designed to validate Pandas, PySpark, and Polars DataFrames using customizable, reusable expectations

Project description

🎯 DataFrameExpectations

CI Publish to PyPI PyPI version PyPI downloads Python 3.10+ License: Apache 2.0 Documentation

DataFrameExpectations is a Python library designed to validate Pandas, PySpark, and Polars DataFrames using customizable, reusable expectations. It simplifies testing in data pipelines and end-to-end workflows by providing a standardized framework for DataFrame validation.

Instead of using different validation approaches for DataFrames, this library provides a standardized solution for this use case. As a result, any contributions made here—such as adding new expectations—can be leveraged by all users of the library.

📚 View Documentation | 📋 List of Expectations

Installation

# pandas only (PySpark/Polars not required)
pip install dataframe-expectations

# with PySpark support
pip install dataframe-expectations[pyspark]

# with Polars support
pip install dataframe-expectations[polars]

# with both PySpark and Polars support
pip install dataframe-expectations[pyspark,polars]

Using a managed PySpark environment? (Databricks, EMR, etc.) PySpark is already available in your runtime — install without the extra to avoid reinstalling it.

Requirements

  • Python 3.10+
  • pandas >= 1.5.0
  • pydantic >= 2.12.4
  • tabulate >= 0.8.9
  • pyspark >= 3.3.0 (optional — install with [pyspark] extra or provide your own)
  • polars >= 1.40.1 (optional — install with [polars] extra)

Quick Start

Pandas Example

from dataframe_expectations.suite import DataFrameExpectationsSuite
import pandas as pd

# Build a suite with expectations
suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Create a runner
runner = suite.build()

# Validate a DataFrame
df = pd.DataFrame({
    "age": [25, 15, 45, 22],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "salary": [50000, 60000, 80000, 45000]
})
runner.run(df)

PySpark Example

from dataframe_expectations.suite import DataFrameExpectationsSuite
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Build a validation suite (same API as Pandas!)
suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Build the runner
runner = suite.build()

# Create a PySpark DataFrame
data = [
    {"age": 25, "name": "Alice", "salary": 50000},
    {"age": 15, "name": "Bob", "salary": 60000},
    {"age": 45, "name": "Charlie", "salary": 80000},
    {"age": 22, "name": "Diana", "salary": 45000}
]
df = spark.createDataFrame(data)

# Validate
runner.run(df)

Polars Example

from dataframe_expectations.suite import DataFrameExpectationsSuite
import polars as pl

# Build a validation suite (same API as Pandas and PySpark!)
suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Build the runner
runner = suite.build()

# Create a Polars DataFrame
df = pl.DataFrame({
    "age": [25, 15, 45, 22],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "salary": [50000, 60000, 80000, 45000]
})

# Validate
runner.run(df)

Validation Patterns

Manual Validation

Use runner.run() to explicitly validate DataFrames:

# Run validation and raise exception on failure
runner.run(df)

# Run validation without raising exception
result = runner.run(df, raise_on_failure=False)

Decorator-Based Validation

Automatically validate function return values using decorators:

from dataframe_expectations.suite import DataFrameExpectationsSuite
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

suite = (
    DataFrameExpectationsSuite()
    .expect_min_rows(min_rows=3)
    .expect_max_rows(max_rows=10)
    .expect_value_greater_than(column_name="age", value=18)
    .expect_value_less_than(column_name="salary", value=100000)
    .expect_value_not_null(column_name="name")
)

# Build the runner
runner = suite.build()

# Apply decorator to automatically validate function output
@runner.validate
def load_employee_data():
    """Load and return employee data - automatically validated."""
    return spark.createDataFrame(
        [
            {"age": 25, "name": "Alice", "salary": 50000},
            {"age": 15, "name": "Bob", "salary": 60000},
            {"age": 45, "name": "Charlie", "salary": 80000},
            {"age": 22, "name": "Diana", "salary": 45000}
        ]
    )

# Function execution automatically validates the returned DataFrame
df = load_employee_data()  # Raises DataFrameExpectationsSuiteFailure if validation fails

# Allow functions that may return None
@runner.validate(allow_none=True)
def conditional_load(should_load: bool):
    """Conditionally load data - validation only runs when DataFrame is returned."""
    if should_load:
        return spark.createDataFrame([{"age": 25, "name": "Alice", "salary": 50000}])
    return None  # No validation when None is returned
Validation Output

When validation runs, you'll see output like this:

========================== Running expectations suite ==========================
ExpectationMinRows (DataFrame contains at least 3 rows) ... OK
ExpectationMaxRows (DataFrame contains at most 10 rows) ... OK
ExpectationValueGreaterThan ('age' is greater than 18) ... FAIL
ExpectationValueLessThan ('salary' is less than 100000) ... OK
ExpectationValueNotNull ('name' is not null) ... OK
============================ 4 success, 1 failures =============================

ExpectationSuiteFailure: (1/5) expectations failed.

================================================================================
List of violations:
--------------------------------------------------------------------------------
[Failed 1/1] ExpectationValueGreaterThan ('age' is greater than 18): Found 1 row(s) where 'age' is not greater than 18.
Some examples of violations:
+-----+------+--------+
| age | name | salary |
+-----+------+--------+
| 15  | Bob  | 60000  |
+-----+------+--------+
================================================================================

Programmatic Result Inspection

Get detailed validation results without raising exceptions:

# Get detailed results without raising exceptions
result = runner.run(df, raise_on_failure=False)

# Inspect validation outcomes
print(f"Total: {result.total_expectations}, Passed: {result.total_passed}, Failed: {result.total_failed}")
print(f"Pass rate: {result.pass_rate:.2%}")
print(f"Duration: {result.total_duration_seconds:.2f}s")
print(f"Applied filters: {result.applied_filters}")

# Access individual results
for exp_result in result.results:
    if exp_result.status == "failed":
        print(f"Failed: {exp_result.description} - {exp_result.violation_count} violations")

Advanced Features

Tag-Based Filtering

Filter which expectations to run using tags:

from dataframe_expectations import DataFrameExpectationsSuite, TagMatchMode

# Tag expectations with priorities and environments
suite = (
    DataFrameExpectationsSuite()
    .expect_value_greater_than(column_name="age", value=18, tags=["priority:high", "env:prod"])
    .expect_value_not_null(column_name="name", tags=["priority:high"])
    .expect_min_rows(min_rows=1, tags=["priority:low", "env:test"])
)

# Run only high-priority checks (OR logic - matches ANY tag)
runner = suite.build(tags=["priority:high"], tag_match_mode=TagMatchMode.ANY)
runner.run(df)

# Run production-critical checks (AND logic - matches ALL tags)
runner = suite.build(tags=["priority:high", "env:prod"], tag_match_mode=TagMatchMode.ALL)
runner.run(df)

Development Setup

To set up the development environment:

# 1. Fork and clone the repository
git clone https://github.com/getyourguide/dataframe-expectations.git
cd dataframe-expectations

# 2. Install UV package manager
pip install uv

# 3. Install development dependencies (this will automatically create a virtual environment)
uv sync --group dev

# 4. Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# 5. Verify your setup
uv run pytest tests/ -n auto --cov=dataframe_expectations

# 6. Install pre-commit hooks
pre-commit install
# This will automatically run checks before each commit

Contributing

We welcome contributions! Whether you're adding new expectations, fixing bugs, or improving documentation, your help is appreciated.

Please see CONTRIBUTING.md for:

  • Development setup instructions
  • How to add new expectations
  • Code style guidelines
  • Testing requirements
  • Pull request process

Security

For security vulnerabilities, please see our Security Policy or contact security@getyourguide.com.

Legal

dataframe-expectations is licensed under the Apache License, Version 2.0. See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataframe_expectations-0.7.0.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataframe_expectations-0.7.0-py3-none-any.whl (52.2 kB view details)

Uploaded Python 3

File details

Details for the file dataframe_expectations-0.7.0.tar.gz.

File metadata

  • Download URL: dataframe_expectations-0.7.0.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dataframe_expectations-0.7.0.tar.gz
Algorithm Hash digest
SHA256 80b3f2fe466ccd236ddad48d8051bb44d0440fd398c240c77f47cf1f05933de3
MD5 992043d8341664d80a7307de116bee75
BLAKE2b-256 e52c8a3f7267427eb8a07f5decc15e800963540c9d3f0418067b1ccc0f747f9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.7.0.tar.gz:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataframe_expectations-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataframe_expectations-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02e99a034605e66692102258971427b2a7262c124a051b04b72669fe65bc5e40
MD5 f4679cf6bcf1f3e092af3664e4692bd0
BLAKE2b-256 0ec386fbc4ad1f6876d2c4d7112ca979f3ef24e4ea0856f6e228851c9b2266f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframe_expectations-0.7.0-py3-none-any.whl:

Publisher: publish.yaml on getyourguide/dataframe-expectations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page