Skip to main content

DataFrame schema drift detection and alerting

Project description

dfdrift

A DataFrame schema drift detection and alerting library for pandas DataFrames.

Features

  • Schema Tracking: Automatically save DataFrame schemas with location information (file:line)
  • Change Detection: Detect schema changes between executions and alert when differences are found
  • Configurable Storage: Support for local file storage and Google Cloud Storage with extensible interface for future cloud storage
  • Configurable Alerting: Built-in stderr alerter and Slack integration with extensible interface for future integrations

Installation

# Basic installation
pip install dfdrift

# With Slack support
pip install dfdrift[slack]

# With Google Cloud Storage support
pip install dfdrift[gcs]

# With all optional features
pip install dfdrift[slack,gcs]

# Development installation
uv pip install -e .

Usage

dfdrift offers two ways to validate DataFrames:

1. Import Replacement

Simply replace your pandas import with dfdrift.pandas:

import dfdrift.pandas as pd

# Configure validation (optional - uses default settings if omitted)
pd.configure_validation()

# All DataFrame operations are automatically validated
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Tokyo', 'Osaka', 'Kyoto']
})
# Schema automatically saved with location info

2. Explicit Validation

import pandas as pd
import dfdrift

# Create a validator instance
validator = dfdrift.DfValidator()

# Validate a DataFrame manually
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Tokyo', 'Osaka', 'Kyoto']
})

validator.validate(df)

Configuration

Custom Storage

Local File Storage

import dfdrift.pandas as pd

# Use custom local directory
pd.configure_validation(
    storage=dfdrift.LocalFileStorage("./my_schemas")
)

Google Cloud Storage

import dfdrift.pandas as pd

# Configure GCS storage (requires: pip install dfdrift[gcs])
# Set GCS_BUCKET and optionally GCS_PREFIX environment variables
gcs_storage = dfdrift.GcsStorage()  # bucket and prefix from env vars
pd.configure_validation(storage=gcs_storage)

# Or pass parameters directly
gcs_storage = dfdrift.GcsStorage(
    bucket="my-dfdrift-bucket",
    prefix="schemas/production"  # Optional, defaults to "dfdrift"
)
pd.configure_validation(storage=gcs_storage)

GCS Authentication: Use one of the following methods:

  • Set GOOGLE_APPLICATION_CREDENTIALS environment variable to service account key file
  • Use Application Default Credentials: gcloud auth application-default login
  • Use Workload Identity in GKE/Cloud Run environments

Custom Alerter

Stderr Alerter (Default)

import dfdrift.pandas as pd

# Built-in stderr alerter (default)
pd.configure_validation(alerter=dfdrift.StderrAlerter())

Slack Alerter

import dfdrift.pandas as pd

# Configure Slack alerts (requires: pip install dfdrift[slack])
# Set SLACK_BOT_TOKEN and SLACK_CHANNEL environment variables
slack_alerter = dfdrift.SlackAlerter()  # Uses env vars
pd.configure_validation(alerter=slack_alerter)

# Or specify channel argument (token from env var)
slack_alerter = dfdrift.SlackAlerter(channel="#data-alerts")
pd.configure_validation(alerter=slack_alerter)

# Or pass both token and channel directly (not recommended for production)
slack_alerter = dfdrift.SlackAlerter(
    channel="#data-alerts",
    token="xoxb-your-bot-token"
)
pd.configure_validation(alerter=slack_alerter)

Custom Alerter

import dfdrift

# Implement your own alerter
class CustomAlerter(dfdrift.Alerter):
    def alert(self, message, location_key, old_schema, new_schema):
        # Send to email, webhook, etc.
        pass

pd.configure_validation(alerter=CustomAlerter())

Schema Change Detection

When a DataFrame schema changes between executions, dfdrift will automatically detect and alert:

  • Added columns: New columns that weren't in the previous schema
  • Removed columns: Columns that existed before but are now missing
  • Type changes: When a column's dtype changes (e.g., int64 → object)
  • Shape changes: When the DataFrame dimensions change

Example alert output:

WARNING: DataFrame schema changed at /path/to/file.py:25. Changes: Added columns: ['new_col']; Column 'age' dtype changed: int64 → object
Location: /path/to/file.py:25

Examples

See the samples/ directory for usage examples:

  • samples/sample.py: Explicit validation
  • samples/sample_custom_path.py: Custom storage path
  • samples/sample_changing_schema.py: Schema change detection demo
  • samples/sample_pandas_import.py: Import replacement

Architecture

Storage Interface

class SchemaStorage(ABC):
    def save_schema(self, location_key: str, schema: Dict[str, Any]) -> None:
        pass
    
    def load_schemas(self) -> Dict[str, Any]:
        pass

Alerter Interface

class Alerter(ABC):
    def alert(self, message: str, location_key: str, old_schema: Dict[str, Any], new_schema: Dict[str, Any]) -> None:
        pass

Schema Format

Schemas are stored as JSON with the following structure:

{
  "/path/to/file.py:line_number": {
    "columns": {
      "column_name": {
        "dtype": "int64",
        "null_count": 0,
        "total_count": 100
      }
    },
    "shape": [100, 3]
  }
}

Development

Run the samples to test functionality:

# Import replacement
uv run python samples/sample_pandas_import.py  # Run twice to see alerts

# Explicit validation
uv run python samples/sample.py

# Test schema change detection
uv run python samples/sample_changing_schema.py  # Run twice to see alerts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfdrift-0.3.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfdrift-0.3.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file dfdrift-0.3.0.tar.gz.

File metadata

  • Download URL: dfdrift-0.3.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for dfdrift-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5b07eec781df5e87cf1410648a9e43be0240a7c78a762c24d0ee4986f130ea81
MD5 0276f11939333677146e8ec4dd047073
BLAKE2b-256 d6bab6f0b671ebe1272740bcd0954ca0a02ab6663679ee76720132de321d01fc

See more details on using hashes here.

File details

Details for the file dfdrift-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dfdrift-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for dfdrift-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab161991381299b8203be6da82989592070c321ba6e5b35ab609d1040b1b829f
MD5 cf6e96271f6a5bd0d2be97c60bf6d3d0
BLAKE2b-256 b0c8c764c5865d1bee885155215fc2ef5b41d87514bb1c16e6c413f31cd2e991

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page