Skip to main content

Pandas validation using Annotated types and decorators

Project description

pdval

CI codecov

Pandas validation using Annotated types and decorators

pdval is a lightweight Python library for validating pandas DataFrames and Series using Python's Annotated types and decorators. It provides a clean, type-safe way to express data validation constraints directly in function signatures.

Features

  • 🎯 Type-safe validation - Uses Python's Annotated types for inline constraints
  • 🐼 Pandas-focused - Built specifically for pandas DataFrames and Series
  • Decorator-based - Simple @validated decorator for automatic validation
  • 🔧 Composable validators - Chain multiple validators together
  • 🎨 Clean syntax - Validation rules live in your type annotations
  • 🚀 Zero runtime overhead - Optional validation can be disabled

Installation

pip install pdval

Or with uv:

uv add pdval

Note: If you prefer using Pandera as the underlying validation engine (for more detailed error reporting and robustness), install the separate package:

pip install pdval-pandera

Quick Start

import pandas as pd
from pdval import validated, Validated, Finite, NonNaN

@validated
def calculate_returns(
    prices: Validated[pd.Series, Finite, NonNaN],
    validate: bool = True
) -> pd.Series:
    """Calculate percentage returns from prices."""
    return prices.pct_change()

# Valid data passes through
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
returns = calculate_returns(prices)

# Invalid data raises ValueError
import numpy as np
bad_prices = pd.Series([100.0, np.inf, 101.0])
# Raises: ValueError: Data must be finite (no Inf, no NaN)
calculate_returns(bad_prices)

Available Validators

Value Validators

  • Finite - Ensures no Inf or NaN values
  • NonNaN - Ensures no NaN values (allows Inf)
  • NonNegative - Ensures all values >= 0
  • Positive - Ensures all values > 0
  • MonoUp - Ensures values are monotonically increasing
  • MonoDown - Ensures values are monotonically decreasing

Index Validators

  • DateTimeIndexed - Ensures index is a DatetimeIndex
  • MonotonicIndex - Ensures index is monotonically increasing

DataFrame Column Validators

  • HasColumns["col1", "col2"] - Ensures specified columns exist
  • Ge["high", "low"] - Ensures one column >= another column
  • Le["low", "high"] - Ensures one column <= another column
  • Gt["high", "low"] - Ensures one column > another column
  • Lt["low", "high"] - Ensures one column < another column

Column-Specific Validators

  • HasColumn["col"] - Check that DataFrame has column (no validation)
  • HasColumn["col", Validator, ...] - Check column exists and apply Series validators

Examples

Basic Series Validation

from pdval import validated, Validated, Positive

@validated
def calculate_log_returns(
    prices: Validated[pd.Series, Positive],
    validate: bool = True
) -> pd.Series:
    """Calculate log returns - prices must be positive."""
    return np.log(prices / prices.shift(1))

prices = pd.Series([100.0, 102.0, 101.0, 103.0])
log_returns = calculate_log_returns(prices)

DataFrame Column Validation

from pdval import validated, Validated, HasColumns, Ge, NonNaN

@validated
def calculate_true_range(
    data: Validated[pd.DataFrame, HasColumns["high", "low", "close"], Ge["high", "low"], NonNaN],
    validate: bool = True
) -> pd.Series:
    """Calculate True Range - requires OHLC data."""
    hl = data["high"] - data["low"]
    hc = abs(data["high"] - data["close"].shift(1))
    lc = abs(data["low"] - data["close"].shift(1))
    return pd.concat([hl, hc, lc], axis=1).max(axis=1)

# Valid OHLC data
ohlc = pd.DataFrame({
    "high": [102, 105, 104],
    "low": [100, 103, 101],
    "close": [101, 104, 102]
})
tr = calculate_true_range(ohlc)

# Missing column raises error
bad_data = pd.DataFrame({"high": [102], "close": [101]})
# Raises: ValueError: Missing columns: ['low']
calculate_true_range(bad_data)

Time Series Validation

from pdval import validated, Validated, DateTimeIndexed, MonotonicIndex, Finite

@validated
def resample_ohlc(
    data: Validated[pd.DataFrame, DateTimeIndexed, MonotonicIndex, Finite],
    freq: str = "1D",
    validate: bool = True
) -> pd.DataFrame:
    """Resample OHLC data to different frequency."""
    return data.resample(freq).agg({
        "open": "first",
        "high": "max",
        "low": "min",
        "close": "last"
    })

# Valid time series
dates = pd.date_range("2024-01-01", periods=10, freq="1H")
data = pd.DataFrame({
    "open": range(100, 110),
    "high": range(101, 111),
    "low": range(99, 109),
    "close": range(100, 110)
}, index=dates)
daily = resample_ohlc(data)

# Non-datetime index raises error
bad_data = data.copy()
bad_data.index = range(len(bad_data))
# Raises: ValueError: Index must be DatetimeIndex
resample_ohlc(bad_data)

Monotonic Value Validation

from pdval import validated, Validated, MonoUp, MonoDown

@validated
def calculate_cumulative_returns(
    prices: Validated[pd.Series, MonoUp],
    validate: bool = True
) -> pd.Series:
    """Calculate cumulative returns - prices must be monotonically increasing."""
    return (prices / prices.iloc[0]) - 1

@validated
def track_drawdown(
    equity: Validated[pd.Series, MonoDown],
    validate: bool = True
) -> pd.Series:
    """Track drawdown - equity must be monotonically decreasing."""
    return (equity / equity.iloc[0]) - 1

Column-Specific Validation with HasColumn

from pdval import validated, Validated, HasColumn, Finite, Positive, MonoUp

@validated
def process_trading_data(
    data: Validated[
        pd.DataFrame,
        HasColumn["price", Finite, Positive],
        HasColumn["volume", Finite, Positive],
        HasColumn["timestamp", MonoUp],
    ],
    validate: bool = True
) -> pd.DataFrame:
    """Process trading data with column-specific validation.

    - price: must exist, be finite and positive
    - volume: must exist, be finite and positive
    - timestamp: must exist and be monotonically increasing
    """
    return data.assign(
        notional=data["price"] * data["volume"]
    )

# Or just check column presence without validation:
@validated
def simple_check(
    data: Validated[pd.DataFrame, HasColumn["price"], HasColumn["volume"]],
    validate: bool = True
) -> float:
    """Just check columns exist, no value validation."""
    return (data["price"] * data["volume"]).sum()

Chaining Multiple Validators

from pdval import validated, Validated, Finite, Positive, DateTimeIndexed

@validated
def calculate_volume_profile(
    volume: Validated[pd.Series, DateTimeIndexed, Finite, Positive],
    validate: bool = True
) -> pd.Series:
    """Calculate volume profile - must be datetime-indexed, finite, positive."""
    return volume.groupby(volume.index.hour).sum()

Optional Validation

The validate parameter allows you to disable validation for performance:

# Validation enabled (default)
result = calculate_returns(prices, validate=True)

# Validation disabled for performance
result = calculate_returns(prices, validate=False)

Custom Validators

Create your own validators by subclassing Validator:

from pdval import Validator, validated, Validated
import pandas as pd

class InRange(Validator):
    """Validator for values within a specific range."""

    def __init__(self, min_val: float, max_val: float):
        self.min_val = min_val
        self.max_val = max_val

    def validate(self, data):
        if isinstance(data, (pd.Series, pd.DataFrame)):
            if (data < self.min_val).any() or (data > self.max_val).any():
                raise ValueError(f"Data must be in range [{self.min_val}, {self.max_val}]")
        return data

@validated
def normalize_percentage(
    data: Validated[pd.Series, InRange(0, 100)],
    validate: bool = True
) -> pd.Series:
    """Normalize percentage data to [0, 1] range."""
    return data / 100

Type Checking

pdval includes a py.typed marker for full type checker support. Your IDE and type checkers (mypy, pyright, basedpyright) will understand the validation annotations.

How Type Checkers Handle Validated

According to PEP 593, Annotated[T, metadata] (which Validated is an alias for) is treated as equivalent to T for type checking purposes. This means:

@validated
def process(data: Validated[pd.Series, Finite], validate: bool = True) -> float:
    return data.sum()

# Type checkers understand that pd.Series is compatible with Validated[pd.Series, ...]
series = pd.Series([1, 2, 3])
result = process(series)  # ✓ Type checker is happy!

The validation metadata is:

  • Preserved at runtime - Used by the @validated decorator for validation
  • Ignored by type checkers - Validated[pd.Series, Finite] is treated as pd.Series

This gives you the best of both worlds: clean type checking and runtime validation.

Powered by Pandera

This branch of pdval uses Pandera as the underlying validation engine. This provides:

  • Robust Validation - Leverages Pandera's comprehensive schema validation
  • Detailed Errors - Granular error reporting for debugging
  • Schema Integration - Compatible with Pandera schemas

While slightly heavier than the lightweight version (available on master), it offers significantly more safety and features.

Performance

pdval is designed to be lightweight with minimal overhead:

  • Validation checks are only performed when validate=True
  • No schema compilation or complex preprocessing
  • Direct numpy/pandas operations for validation
  • Optional validation can be disabled for production performance

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Why pdval?

Problem: When building data analysis pipelines with pandas, you often need to validate:

  • Data has no NaN or Inf values
  • DataFrames have required columns
  • Values are in expected ranges
  • Indices are properly formatted

Traditional approach: Add manual validation checks at the start of each function.

With pdval: Express validation constraints directly in type annotations using Validated[Type, Validator, ...] and get automatic validation with the @validated decorator.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdval_pandera-0.1.10.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdval_pandera-0.1.10-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file pdval_pandera-0.1.10.tar.gz.

File metadata

  • Download URL: pdval_pandera-0.1.10.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdval_pandera-0.1.10.tar.gz
Algorithm Hash digest
SHA256 d3bc669bff54f2c62c797848a6cbb8914fbab3199487da026cf85e577b047098
MD5 b0b648cd141cd43f50302150ec03e712
BLAKE2b-256 a48c66926a48da2465e921cba86852d5e187032fb955ff44c8a38cf8a6534bc2

See more details on using hashes here.

File details

Details for the file pdval_pandera-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: pdval_pandera-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdval_pandera-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 268b62e2de00a4b085a4c9e702bcb72c4f588f184e35c8f90c1c6f12ac60c236
MD5 7f478d786bd79a35b592c2f457f2f9bd
BLAKE2b-256 1b63659bfbd594aad9f74204f3007235f0a15b720e1be03c861d82a6f829fa59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page