Pandas validation using Annotated types and decorators
Project description
datawarden
Pandas validation using Annotated types and decorators
datawarden is a lightweight Python library for validating pandas DataFrames and Series using Python's Annotated types and decorators. It provides a clean, type-safe way to express data validation constraints directly in function signatures.
Features
- 🎯 Type-safe validation - Uses Python's
Annotatedtypes for inline constraints - 🐼 Pandas-focused - Built specifically for pandas DataFrames and Series
- ⚡ Decorator-based - Simple
@validatedecorator for automatic validation - 🔧 Composable validators - Chain multiple validators together
- 🎨 Clean syntax - Validation rules live in your type annotations
- 🚀 Zero runtime overhead - Optional validation can be disabled
Installation
pip install datawarden
Or with uv:
uv add datawarden
Quick Start
import pandas as pd
from datawarden import validate, Validated, Finite, NonNaN, NonEmpty
@validate
def calculate_returns(
prices: Validated[pd.Series, Finite, NonNaN, NonEmpty],
) -> pd.Series:
"""Calculate percentage returns from prices.
Data is explicitly checked for:
- Not empty (NonEmpty)
- No NaN values (NonNaN)
- No infinite values (Finite)
"""
return prices.pct_change()
# Valid data passes through
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
returns = calculate_returns(prices)
# Invalid data raises ValueError
import numpy as np
bad_prices = pd.Series([100.0, np.inf, 101.0])
# Raises: ValueError: Data must be finite (contains Inf)
calculate_returns(bad_prices)
Available Validators
Value Validators (Series/Index)
Finite- Ensures no Inf values (allows NaN)StrictFinite- Ensures no Inf AND no NaN valuesNonNaN- Ensures no NaN values (allows Inf)NonNegative- Ensures all values >= 0Positive- Ensures all values > 0NonEmpty- Ensures data is not emptyUnique- Ensures all values are uniqueMonoUp- Ensures values are monotonically increasingMonoDown- Ensures values are monotonically decreasingDatetime- Ensures data is a DatetimeIndexOneOf("a", "b", "c")- Ensures values are in allowed set (categorical)
NaN-Tolerant Validation with IgnoringNaNs
The IgnoringNaNs wrapper allows validators to skip NaN values during validation. It can be used in two ways:
1. Explicit wrapping - wrap specific validators:
from datawarden import validate, Validated, IgnoringNaNs, Ge, Lt
import pandas as pd
@validate
def process(
data: Validated[pd.Series, IgnoringNaNs(Ge(0)), Lt(10)],
) -> pd.Series:
# Ge(0) ignores NaNs: values >= 0 OR NaN are valid
# Lt(10) still rejects NaNs (default behavior)
return data
2. Marker mode - apply to all validators with IgnoringNaNs():
@validate
def process(
data: Validated[pd.Series, Ge(0), Lt(100), IgnoringNaNs()],
) -> pd.Series:
# Equivalent to: IgnoringNaNs(Ge(0)), IgnoringNaNs(Lt(100))
# All validators now ignore NaN values
return data
# NaN values pass through, non-NaN values are validated
import numpy as np
data = pd.Series([10.0, np.nan, 50.0, np.nan, 90.0])
result = process(data) # Works! NaNs are ignored
Works with: Ge, Le, Gt, Lt, Positive, NonNegative, Finite, and any value validator.
Shape Validators
Shape(10, 5)- Exact shape (10 rows, 5 columns)Shape(Ge(10), Any)- At least 10 rows, any columnsShape(Any, Le(5))- Any rows, at most 5 columnsShape(Gt(0), Lt(100))- More than 0 rows, less than 100 columnsShape(100)- For Series: exactly 100 rows
Index Wrapper
The Index() wrapper allows you to apply any Series/Index validator to the index of a Series or DataFrame:
Index(Datetime)- Ensures index is a DatetimeIndexIndex(MonoUp)- Ensures index is monotonically increasingIndex(Unique)- Ensures index values are uniqueIndex(Datetime, MonoUp, Unique)- Combine multiple validators
DataFrame Column Validators
HasColumns("col1", "col2")- Ensures specified columns existGe("high", "low")- Ensures one column >= another columnLe("low", "high")- Ensures one column <= another columnGt("high", "low")- Ensures one column > another columnLt("low", "high")- Ensures one column < another column
Column-Specific Validators
HasColumn("col")- Check that DataFrame has columnHasColumn("col", Validator, ...)- Check column exists and apply Series validators
Lambda Validators
Is(predicate, name=None)- Element-wise predicate validationRows(predicate, name=None)- Row-wise predicate validation for DataFrames
from datawarden import validate, Validated, Is, Rows, HasColumn
import pandas as pd
# Element-wise: check all values satisfy condition
@validate
def process_values(
data: Validated[pd.Series, Is(lambda x: (x >= 0) & (x <= 100))],
) -> pd.Series:
return data
# Column-specific with Is
@validate
def process_roots(
data: Validated[pd.DataFrame, HasColumn("root", Is(lambda x: x**2 < 2))],
) -> pd.DataFrame:
return data
# Row-wise: check each row satisfies condition
@validate
def process_ohlc(
data: Validated[pd.DataFrame, Rows(lambda row: row["high"] >= row["low"])],
) -> pd.DataFrame:
return data
# With descriptive error name
@validate
def process_budget(
data: Validated[pd.DataFrame, Rows(lambda row: row.sum() < 100, name="row sum must be < 100")],
) -> pd.DataFrame:
return data
Gap Validators (Time Series)
NoTimeGaps- Ensures no gaps in datetime values/indexMaxGap(timedelta)- Ensures maximum gap between datetime valuesMaxDiff(value)- Ensures maximum difference between consecutive values
Examples
Basic Series Validation
from datawarden import validate, Validated, Positive, NonNaN
import numpy as np
import pandas as pd
@validate
def calculate_log_returns(
prices: Validated[pd.Series, Positive, NonNaN],
) -> pd.Series:
"""Calculate log returns - prices must be positive and not NaN."""
return np.log(prices / prices.shift(1))
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
log_returns = calculate_log_returns(prices)
DataFrame Column Validation
from datawarden import validate, Validated, HasColumns, Ge, NonNaN
import pandas as pd
@validate
def calculate_true_range(
data: Validated[pd.DataFrame, HasColumns("high", "low", "close"), Ge("high", "low"), NonNaN],
) -> pd.Series:
"""Calculate True Range - requires OHLC data."""
hl = data["high"] - data["low"]
hc = abs(data["high"] - data["close"].shift(1))
lc = abs(data["low"] - data["close"].shift(1))
return pd.concat([hl, hc, lc], axis=1).max(axis=1)
# Valid OHLC data
ohlc = pd.DataFrame({
"high": [102, 105, 104],
"low": [100, 103, 101],
"close": [101, 104, 102]
})
tr = calculate_true_range(ohlc)
# Missing column raises error
bad_data = pd.DataFrame({"high": [102], "close": [101]})
# Raises: ValueError: Missing columns: ['low']
calculate_true_range(bad_data)
Time Series Validation with Index
from datawarden import validate, Validated, Index, Datetime, MonoUp, Finite
import pandas as pd
@validate
def resample_ohlc(
data: Validated[pd.DataFrame, Index(Datetime, MonoUp), Finite],
freq: str = "1D",
) -> pd.DataFrame:
"""Resample OHLC data to different frequency."""
return data.resample(freq).agg({
"open": "first",
"high": "max",
"low": "min",
"close": "last"
})
# Valid time series
dates = pd.date_range("2024-01-01", periods=10, freq="1h")
data = pd.DataFrame({
"open": range(100, 110),
"high": range(101, 111),
"low": range(99, 109),
"close": range(100, 110)
}, index=dates)
daily = resample_ohlc(data)
# Non-datetime index raises error
bad_data = data.copy()
bad_data.index = range(len(bad_data))
# Raises: ValueError: Index must be DatetimeIndex
resample_ohlc(bad_data)
Unique Values Validation
from datawarden import validate, Validated, Index, Unique
import pandas as pd
@validate
def process_unique_ids(
data: Validated[pd.DataFrame, Index(Unique)],
) -> pd.DataFrame:
"""Process data with unique index values."""
return data.sort_index()
# Valid unique index
df = pd.DataFrame({"a": [1, 2, 3]}, index=["x", "y", "z"])
result = process_unique_ids(df)
# Duplicate index values raise error
bad_df = pd.DataFrame({"a": [1, 2, 3]}, index=["x", "x", "z"])
# Raises: ValueError: Values must be unique
process_unique_ids(bad_df)
Categorical Values Validation
from typing import Literal
from datawarden import validate, Validated, OneOf, HasColumn
import pandas as pd
@validate
def process_orders(
data: Validated[pd.DataFrame, HasColumn("status", OneOf("pending", "shipped", "delivered"))],
) -> pd.DataFrame:
"""Process orders with validated status column."""
return data[data["status"] != "pending"]
# Valid data
orders = pd.DataFrame({
"order_id": [1, 2, 3],
"status": ["pending", "shipped", "delivered"]
})
result = process_orders(orders)
# Invalid status raises error
bad_orders = pd.DataFrame({
"order_id": [1, 2],
"status": ["pending", "cancelled"] # "cancelled" not in allowed values
})
# Raises: ValueError: Values must be one of {'pending', 'shipped', 'delivered'}, got invalid: {'cancelled'}
process_orders(bad_orders)
Monotonic Value Validation
from datawarden import validate, Validated, MonoUp, MonoDown
import pandas as pd
@validate
def calculate_cumulative_returns(
prices: Validated[pd.Series, MonoUp],
) -> pd.Series:
"""Calculate cumulative returns - prices must be monotonically increasing."""
return (prices / prices.iloc[0]) - 1
@validate
def track_drawdown(
equity: Validated[pd.Series, MonoDown],
) -> pd.Series:
"""Track drawdown - equity must be monotonically decreasing."""
return (equity / equity.iloc[0]) - 1
Shape Validation
from typing import Any
from datawarden import validate, Validated, Shape, Ge, Le
import pandas as pd
@validate
def process_batch(
data: Validated[pd.DataFrame, Shape(Ge(10), Any)],
) -> pd.DataFrame:
"""Process data batch - must have at least 10 rows."""
return data.describe()
# Valid data (10+ rows)
df = pd.DataFrame({"a": range(20), "b": range(20)})
result = process_batch(df)
# Too few rows raises error
small_df = pd.DataFrame({"a": [1, 2, 3]})
# Raises: ValueError: DataFrame must have >= 10 rows, got 3
process_batch(small_df)
# Constrain both dimensions
@validate
def process_matrix(
data: Validated[pd.DataFrame, Shape(Ge(5), Le(10))],
) -> pd.DataFrame:
"""Process matrix - 5+ rows, max 10 columns."""
return data
# Exact shape for Series
@validate
def process_vector(
data: Validated[pd.Series, Shape(100)],
) -> pd.Series:
"""Process vector - must have exactly 100 elements."""
return data
Column-Specific Validation with HasColumn
from datawarden import validate, Validated, HasColumn, Finite, Positive, MonoUp
import pandas as pd
@validate
def process_trading_data(
data: Validated[
pd.DataFrame,
HasColumn("price", Finite, Positive),
HasColumn("volume", Finite, Positive),
HasColumn("timestamp", MonoUp),
],
) -> pd.DataFrame:
"""Process trading data with column-specific validation.
- price: must exist, be finite and positive
- volume: must exist, be finite and positive
- timestamp: must exist and be monotonically increasing
"""
return data.assign(
notional=data["price"] * data["volume"]
)
# Or just check column presence:
@validate
def simple_check(
data: Validated[pd.DataFrame, HasColumn("price"), HasColumn("volume")],
) -> float:
"""Just check columns exist."""
return (data["price"] * data["volume"]).sum()
Chaining Multiple Index Validators
from datawarden import validate, Validated, Index, Datetime, MonoUp, Unique, Finite, Positive
import pandas as pd
@validate
def calculate_volume_profile(
volume: Validated[pd.Series, Index(Datetime, MonoUp, Unique), Finite, Positive],
) -> pd.Series:
"""Calculate volume profile - must be datetime-indexed, monotonic, unique, finite, positive."""
return volume.groupby(volume.index.hour).sum()
Optional Validation
Use skip_validation to disable validation for performance:
# Validation enabled (default)
result = calculate_returns(prices)
# Validation disabled for performance
result = calculate_returns(prices, skip_validation=True)
Custom Validators
Create your own validators by subclassing Validator:
from datawarden import Validator, validate, Validated
import pandas as pd
class InRange(Validator):
"""Validator for values within a specific range."""
def __init__(self, min_val: float, max_val: float):
self.min_val = min_val
self.max_val = max_val
def validate(self, data):
if isinstance(data, (pd.Series, pd.DataFrame)):
if (data < self.min_val).any() or (data > self.max_val).any():
raise ValueError(f"Data must be in range [{self.min_val}, {self.max_val}]")
return data
@validate
def normalize_percentage(
data: Validated[pd.Series, InRange(0, 100)],
) -> pd.Series:
"""Normalize percentage data to [0, 1] range."""
return data / 100
Performance & Optimization
datawarden is designed for high-performance data pipelines:
Parallel Validation
When a function accepts multiple validated arguments, datawarden automatically validates them in parallel using a thread pool. This leverages the release of the GIL during pandas/numpy operations, providing significant speedups for large datasets.
@validate
def process_large_data(
source: Validated[pd.DataFrame, Finite, NonNaN],
target: Validated[pd.DataFrame, Finite, NonNaN],
) -> pd.DataFrame:
# Both 'source' and 'target' are validated concurrently
return pd.merge(source, target, on="id")
Zero-Overhead Production Mode
For maximum performance in production critical paths, you can disable validation globally or per-call:
- Per-call:
func(data, skip_validation=True) - Defaults: Use
@validate(skip_validation_by_default=True)for functions that should only be validated during development or debugging.
Cached Validator Compilation
Validation logic is pre-compiled at import time (when the decorator runs). The runtime overhead is minimal, consisting only of the necessary numpy/pandas checks.
Type Checking
datawarden includes a py.typed marker for full type checker support. Your IDE and type checkers (mypy, pyright, basedpyright) will understand the validation annotations.
How Type Checkers Handle Validated
According to PEP 593, Annotated[T, metadata] (which Validated is an alias for) is treated as equivalent to T for type checking purposes. This means:
@validate
def process(data: Validated[pd.Series, Finite]) -> float:
return data.sum()
# Type checkers understand that pd.Series is compatible with Validated[pd.Series, ...]
series = pd.Series([1, 2, 3])
result = process(series) # ✓ Type checker is happy!
The validation metadata is:
- Preserved at runtime - Used by the
@validatedecorator for validation - Ignored by type checkers -
Validated[pd.Series, Finite]is treated aspd.Series
This gives you the best of both worlds: clean type checking and runtime validation.
Opt-in Strictness
Validation in datawarden is opt-in. By default, arguments wrapped in Validated[...] are only checked for type compatibility (via other tools) unless validators are provided.
To enforce strict checks like "no NaNs" or "not empty", you must explicitly add the corresponding validators:
from datawarden import validate, Validated, NonNaN, NonEmpty
import pandas as pd
@validate
def process_flexible(data: Validated[pd.Series, None]) -> float:
"""Accepts any Series (NaNs and empty allowed)."""
if data.empty:
return 0.0
return data.sum()
@validate
def process_strict(data: Validated[pd.Series, NonNaN, NonEmpty]) -> float:
"""Rejects NaNs and empty data."""
return data.sum()
Comparison with Pandera
While Pandera is excellent for comprehensive schema validation, datawarden offers a lighter-weight alternative focused on:
- Inline validation - Constraints live in function signatures
- Decorator simplicity - Single
@validatedecorator - Type annotation syntax - Uses Python's native
Annotatedtypes - Minimal overhead - Lightweight with no heavy dependencies
Use datawarden when you want simple, inline validation. Use Pandera when you need comprehensive schema management, complex validation logic, or data contracts.
Performance
datawarden is designed to be lightweight with minimal overhead:
- Validation checks are only performed when
skip_validation=False(default) - No schema compilation or complex preprocessing
- Direct numpy/pandas operations for validation
- Optional validation can be disabled for production performance
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Why datawarden?
Problem: When building data analysis pipelines with pandas, you often need to validate:
- Data has no NaN or Inf values
- DataFrames have required columns
- Values are in expected ranges
- Indices are properly formatted
Traditional approach: Add manual validation checks at the start of each function.
With datawarden: Express validation constraints directly in type annotations using Validated[Type, Validator, ...] and get automatic validation with the @validate decorator.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datawarden-0.1.0.tar.gz.
File metadata
- Download URL: datawarden-0.1.0.tar.gz
- Upload date:
- Size: 41.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3964dbc3bec961d5d07010830e589b22c39089b98d5dde09b41f83de45b34aac
|
|
| MD5 |
0658ec9fd5d193c06208ca420c8d53d6
|
|
| BLAKE2b-256 |
2b895ebe2345d8c9ea99f4f0820f1ab146f1ade39d39767a32c6fb6776a50dab
|
File details
Details for the file datawarden-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datawarden-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64a11f8c73429d6bdd6726111eda00fe1f2000a4b5905504a1754bccf6807b18
|
|
| MD5 |
0c9c7709e7edec0654b91fe441fd6f23
|
|
| BLAKE2b-256 |
6eee91c954857d8ab962efe527fd78798b7136742c55211b1c27a35c504e0160
|