Skip to main content

DataFrame schema drift detection and alerting

Project description

dfdrift

A DataFrame schema drift detection and alerting library for pandas DataFrames.

Features

  • Schema Tracking: Automatically save DataFrame schemas with location information (file:line)
  • Change Detection: Detect schema changes between executions and alert when differences are found
  • Configurable Storage: Support for local file storage with extensible interface for future cloud storage (GCS, etc.)
  • Configurable Alerting: Built-in stderr alerter with extensible interface for future integrations (Slack, etc.)

Installation

# Install in development mode
uv pip install -e .

Usage

dfdrift offers two ways to validate DataFrames:

1. Import Replacement

Simply replace your pandas import with dfdrift.pandas:

import dfdrift.pandas as pd

# Configure validation (optional - uses default settings if omitted)
pd.configure_validation()

# All DataFrame operations are automatically validated
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Tokyo', 'Osaka', 'Kyoto']
})
# Schema automatically saved with location info

2. Explicit Validation

import pandas as pd
import dfdrift

# Create a validator instance
validator = dfdrift.DfValidator()

# Validate a DataFrame manually
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Tokyo', 'Osaka', 'Kyoto']
})

validator.validate(df)

Configuration

Custom Storage Path

import dfdrift

# Method 1: Import replacement with custom storage
import dfdrift.pandas as pd
pd.configure_validation(
    storage=dfdrift.LocalFileStorage("./my_schemas")
)

# Method 2: Explicit validation with custom storage
validator = dfdrift.DfValidator(
    storage=dfdrift.LocalFileStorage("./my_schemas")
)

Custom Alerter

import dfdrift

# Built-in stderr alerter (default)
import dfdrift.pandas as pd
pd.configure_validation(alerter=dfdrift.StderrAlerter())

# Or implement your own alerter
class SlackAlerter(dfdrift.Alerter):
    def alert(self, message, location_key, old_schema, new_schema):
        # Send to Slack
        pass

pd.configure_validation(alerter=SlackAlerter())

Schema Change Detection

When a DataFrame schema changes between executions, dfdrift will automatically detect and alert:

  • Added columns: New columns that weren't in the previous schema
  • Removed columns: Columns that existed before but are now missing
  • Type changes: When a column's dtype changes (e.g., int64 → object)
  • Shape changes: When the DataFrame dimensions change

Example alert output:

WARNING: DataFrame schema changed at /path/to/file.py:25. Changes: Added columns: ['new_col']; Column 'age' dtype changed: int64 → object
Location: /path/to/file.py:25

Examples

See the samples/ directory for usage examples:

  • samples/sample.py: Explicit validation
  • samples/sample_custom_path.py: Custom storage path
  • samples/sample_changing_schema.py: Schema change detection demo
  • samples/sample_pandas_import.py: Import replacement

Architecture

Storage Interface

class SchemaStorage(ABC):
    def save_schema(self, location_key: str, schema: Dict[str, Any]) -> None:
        pass
    
    def load_schemas(self) -> Dict[str, Any]:
        pass

Alerter Interface

class Alerter(ABC):
    def alert(self, message: str, location_key: str, old_schema: Dict[str, Any], new_schema: Dict[str, Any]) -> None:
        pass

Schema Format

Schemas are stored as JSON with the following structure:

{
  "/path/to/file.py:line_number": {
    "columns": {
      "column_name": {
        "dtype": "int64",
        "null_count": 0,
        "total_count": 100
      }
    },
    "shape": [100, 3]
  }
}

Development

Run the samples to test functionality:

# Import replacement
uv run python samples/sample_pandas_import.py  # Run twice to see alerts

# Explicit validation
uv run python samples/sample.py

# Test schema change detection
uv run python samples/sample_changing_schema.py  # Run twice to see alerts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfdrift-0.1.2.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfdrift-0.1.2-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file dfdrift-0.1.2.tar.gz.

File metadata

  • Download URL: dfdrift-0.1.2.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for dfdrift-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7aa045fbef2d9bfdd245c7208afb49143671556370e8295f3113503180ed52b6
MD5 27210427beb68ada9dc2548acf401e60
BLAKE2b-256 c71c15cb5a22fde0f3304e6e0c6c864618667eb461c551aaf684961061ef20c9

See more details on using hashes here.

File details

Details for the file dfdrift-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dfdrift-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.21

File hashes

Hashes for dfdrift-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 85234e585ae341ed81f83a3d3cd81b95bd04fb73203f2f3567f1629299e03f9f
MD5 a6c39863a67682ee23ee81f4f804d95f
BLAKE2b-256 477196353908a734fd076d8fb91ee3d79a6d788b326323d37c1b915a56181909

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page