Declarative data validation library for Pandas
Project description
datawarden
High-performance, JIT-accelerated data validation for Pandas and NumPy.
datawarden is a high-performance validation library that provides a clean, type-safe way to express data validation constraints directly in function signatures. It utilizes Python type hints to declare validation rules, which are then compiled into optimized machine code using Numba JIT for near-zero runtime overhead.
🚀 Why Datawarden?
- 🎯 Type-Safe Declarations: Use
Annotatedtypes (Validated[T, ...]) to define constraints directly in your function signatures. - ⚡ Numba JIT Acceleration: Complex logical chains are fused and compiled, achieving up to 75x speedups over vectorized NumPy/Pandas for certain operations.
- 🧵 Parallel Execution: Automatically validates multiple function arguments in parallel using a thread pool.
- 📦 Memory Efficient: Supports chunked validation, allowing you to validate datasets larger than your RAM with O(1) memory overhead.
- 🔧 N-ary Comparisons: Compare multiple columns (e.g.,
Ge('high', 'low', 'open')) with zero-copy JIT execution. - 🔄 Cross-Chunk Continuity: Built-in support for stateful sequence validation (e.g., monotonicity across streaming data chunks).
📦 Installation
pip install datawarden
Or with uv:
uv add datawarden
🛠️ Quick Start
import pandas as pd
import numpy as np
from datawarden import validate, Validated, Gt, Finite, NotEmpty
@validate
def calculate_returns(
prices: Validated[pd.Series, NotEmpty, Finite],
threshold: Validated[float, Gt(0)] = 0.01
) -> pd.Series:
"""
prices is validated to be NotEmpty and have only Finite values (no NaN/Inf).
threshold is validated to be > 0.
"""
return prices.pct_change()
# Valid data passes through
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
returns = calculate_returns(prices)
# Invalid data raises ValidationError with a detailed report
bad_prices = pd.Series([100.0, np.nan, 102.0])
# Raises: ValidationError: Data contains non-finite values (NaN/Inf)
calculate_returns(bad_prices)
💎 Advanced Features
🔗 Logical Composition
Combine validators using standard Python logical operators. datawarden will fuse these into a single optimized pass.
from datawarden import Ge, Le, IsNaN
# Value must be between 0 and 1, or can be NaN
UnitValue = Validated[pd.Series, (Ge(0) & Le(1)) | IsNaN()]
📊 N-ary Column Comparisons
Validate relationships across multiple columns in a DataFrame without manual iteration or heavy Pandas operations.
from datawarden import Ge
# Validates that 'max' >= 'min' AND 'min' >= 'base' for all rows
@validate
def check_bounds(df: Validated[pd.DataFrame, Ge('max', 'min', 'base')]):
...
📈 Stateful Sequence Validation
Maintain validation state across data chunks – essential for streaming pipelines.
from datawarden import MonoUp, NoTimeGaps
# Ensure timestamps are strictly increasing and have no gaps across all chunks
@validate
def ingest_stream(chunk: Validated[pd.DataFrame, Index(MonoUp(strict=True) & NoTimeGaps("1min"))]):
...
⚡ Performance Benchmarks
datawarden is built for speed. By fusing operations and avoiding intermediate allocations, it significantly outperforms standard approaches on large datasets (~10M+ rows).
| Operation | Pandas/NumPy | Datawarden (JIT) | Improvement |
|---|---|---|---|
Ge(0) & Le(1) |
~15ms | ~0.2ms | 75x |
MonoUp (Monotonic) |
~24ms | ~8ms | 3x |
Multi-column Ge |
~45ms | ~0.5ms | 90x |
[!NOTE] Benchmarks performed on a modern CPU with 10M rows. Numba fusion provides the biggest gains for complex logical chains.
🛠️ Configuration
Fine-tune the behavior of datawarden using the Overrides context manager or global config.
from datawarden import Overrides
# Process a massive dataset in chunks to save memory
with Overrides(chunk_size_rows=100_000, use_numba=True):
my_heavy_function(massive_df)
# Disable validation for an entire module during import to avoid redundant checks
with Overrides(skip_validation=True):
import sensitive_library_already_validated
[!NOTE]
Overrides(skip_validation=True)is particularly useful when importing a library that usesdatawardeninternally, but you've already validated the data upstream or want to disable validation for performance in a production environment.
| Option | Default | Description |
|---|---|---|
skip_validation |
False |
Globally disable validation for production hot-loops. |
warn_only |
False |
Log a warning instead of raising ValidationError. |
chunk_size_rows |
None |
Automatically split large data into chunks for memory efficiency. |
use_numba |
True |
Enable/Disable JIT compilation via Numba. |
parallel_threshold |
100,000 |
Minimum row count to trigger parallel multi-argument validation. |
📖 Available Validators
Structural
Index(validator): Apply any validator to the data index.Columns(validator): Validate column names/presence.Column(name, validator): Apply validator to a specific column.Shape(rows, cols): Validate container dimensions.NotEmpty/Empty: Check for content existence.
Numeric
Gt,Ge,Lt,Le,Eq,Ne: Standard comparisons (with multi-column support).Finite: NoNaNorInf.NotNaN/IsNaN: Null checks.Positive/Negative/NonNegative/NonPositive: Sign checks.
Sequence & Stateful
MonoUp/MonoDown: Monotonicity (strict or non-strict).NoTimeGaps(freq): Continuous time series check.MaxGap(limit): Maximum interval size check.
Value & Custom
Between(low, high)/Outside(low, high): Range checks.OneOf(*values): Set membership.Is(predicate): Custom lambda/function element-wise check.Rows(predicate): Custom row-wise DataFrame check.
📜 License
MIT License. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datawarden-0.1.1.tar.gz.
File metadata
- Download URL: datawarden-0.1.1.tar.gz
- Upload date:
- Size: 77.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
feb9fe1ff4ad7c8427e5e37fa9703c8df39f94b8503171fbbe93a0d1224559e6
|
|
| MD5 |
5650724684b1662fe40d164e5c35ca15
|
|
| BLAKE2b-256 |
b83e372ebe144a9f827d076a4600d06b78578c5adc00de9036dabb1cf2580e62
|
Provenance
The following attestation bundles were made for datawarden-0.1.1.tar.gz:
Publisher:
publish.yml on sencer/datawarden
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datawarden-0.1.1.tar.gz -
Subject digest:
feb9fe1ff4ad7c8427e5e37fa9703c8df39f94b8503171fbbe93a0d1224559e6 - Sigstore transparency entry: 1060689995
- Sigstore integration time:
-
Permalink:
sencer/datawarden@9c54b239018f62733b49190d4fa30b250af91a29 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/sencer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9c54b239018f62733b49190d4fa30b250af91a29 -
Trigger Event:
release
-
Statement type:
File details
Details for the file datawarden-0.1.1-py3-none-any.whl.
File metadata
- Download URL: datawarden-0.1.1-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaa39f09d79e415d42e45a31822b6231c879501cf38933f2ac32e0ec04c85e41
|
|
| MD5 |
82dcad12e280d9803cb79fb46930512a
|
|
| BLAKE2b-256 |
e78e8723d4951b0ea9f4a67c73beebaf0df88a6ef62950eec3db8e20dacc8338
|
Provenance
The following attestation bundles were made for datawarden-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on sencer/datawarden
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datawarden-0.1.1-py3-none-any.whl -
Subject digest:
eaa39f09d79e415d42e45a31822b6231c879501cf38933f2ac32e0ec04c85e41 - Sigstore transparency entry: 1060689999
- Sigstore integration time:
-
Permalink:
sencer/datawarden@9c54b239018f62733b49190d4fa30b250af91a29 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/sencer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9c54b239018f62733b49190d4fa30b250af91a29 -
Trigger Event:
release
-
Statement type: