Declarative data validation library for Pandas

Project description

datawarden

codecov

High-performance, JIT-accelerated data validation for Pandas and NumPy.

datawarden is a high-performance validation library that provides a clean, type-safe way to express data validation constraints directly in function signatures. It utilizes Python type hints to declare validation rules, which are then compiled into optimized machine code using Numba JIT for near-zero runtime overhead.

🚀 Why Datawarden?

🎯 Type-Safe Declarations: Use Annotated types (Validated[T, ...]) to define constraints directly in your function signatures.
⚡ Numba JIT Acceleration: Complex logical chains are fused and compiled, achieving up to 75x speedups over vectorized NumPy/Pandas for certain operations.
🧵 Parallel Execution: Automatically validates multiple function arguments in parallel using a thread pool.
📦 Memory Efficient: Supports chunked validation, allowing you to validate datasets larger than your RAM with O(1) memory overhead.
🔧 N-ary Comparisons: Compare multiple columns (e.g., Ge('high', 'low', 'open')) with zero-copy JIT execution.
🔄 Cross-Chunk Continuity: Built-in support for stateful sequence validation (e.g., monotonicity across streaming data chunks).

📦 Installation

pip install datawarden

Or with uv:

uv add datawarden

🛠️ Quick Start

import pandas as pd
import numpy as np
from datawarden import validate, Validated, Gt, Finite, NotEmpty

@validate
def calculate_returns(
    prices: Validated[pd.Series, NotEmpty, Finite],
    threshold: Validated[float, Gt(0)] = 0.01
) -> pd.Series:
    """
    prices is validated to be NotEmpty and have only Finite values (no NaN/Inf).
    threshold is validated to be > 0.
    """
    return prices.pct_change()

# Valid data passes through
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
returns = calculate_returns(prices)

# Invalid data raises ValidationError with a detailed report
bad_prices = pd.Series([100.0, np.nan, 102.0])
# Raises: ValidationError: Data contains non-finite values (NaN/Inf)
calculate_returns(bad_prices)

💎 Advanced Features

🔗 Logical Composition

Combine validators using standard Python logical operators. datawarden will fuse these into a single optimized pass.

from datawarden import Ge, Le, IsNaN

# Value must be between 0 and 1, or can be NaN
UnitValue = Validated[pd.Series, (Ge(0) & Le(1)) | IsNaN()]

📊 N-ary Column Comparisons

Validate relationships across multiple columns in a DataFrame without manual iteration or heavy Pandas operations.

from datawarden import Ge

# Validates that 'max' >= 'min' AND 'min' >= 'base' for all rows
@validate
def check_bounds(df: Validated[pd.DataFrame, Ge('max', 'min', 'base')]):
    ...

📈 Stateful Sequence Validation

Maintain validation state across data chunks – essential for streaming pipelines.

from datawarden import MonoUp, NoTimeGaps

# Ensure timestamps are strictly increasing and have no gaps across all chunks
@validate
def ingest_stream(chunk: Validated[pd.DataFrame, Index(MonoUp(strict=True) & NoTimeGaps("1min"))]):
    ...

⚡ Performance Benchmarks

datawarden is built for speed. By fusing operations and avoiding intermediate allocations, it significantly outperforms standard approaches on large datasets (~10M+ rows).

Operation	Pandas/NumPy	Datawarden (JIT)	Improvement
`Ge(0) & Le(1)`	~15ms	~0.2ms	75x
`MonoUp` (Monotonic)	~24ms	~8ms	3x
Multi-column `Ge`	~45ms	~0.5ms	90x

[!NOTE] Benchmarks performed on a modern CPU with 10M rows. Numba fusion provides the biggest gains for complex logical chains.

🛠️ Configuration

Fine-tune the behavior of datawarden using the Overrides context manager or global config.

from datawarden import Overrides

# Process a massive dataset in chunks to save memory
with Overrides(chunk_size_rows=100_000, use_numba=True):
    my_heavy_function(massive_df)

# Disable validation for an entire module during import to avoid redundant checks
with Overrides(skip_validation=True):
    import sensitive_library_already_validated

[!NOTE] Overrides(skip_validation=True) is particularly useful when importing a library that uses datawarden internally, but you've already validated the data upstream or want to disable validation for performance in a production environment.

Option	Default	Description
`skip_validation`	`False`	Globally disable validation for production hot-loops.
`warn_only`	`False`	Log a warning instead of raising `ValidationError`.
`chunk_size_rows`	`None`	Automatically split large data into chunks for memory efficiency.
`use_numba`	`True`	Enable/Disable JIT compilation via Numba.
`parallel_threshold`	`100,000`	Minimum row count to trigger parallel multi-argument validation.

📖 Available Validators

Structural

Index(validator): Apply any validator to the data index.
Columns(validator): Validate column names/presence.
Column(name, validator): Apply validator to a specific column.
Shape(rows, cols): Validate container dimensions.
NotEmpty / Empty: Check for content existence.

Numeric

Gt, Ge, Lt, Le, Eq, Ne: Standard comparisons (with multi-column support).
Finite: No NaN or Inf.
NotNaN / IsNaN: Null checks.
Positive / Negative / NonNegative / NonPositive: Sign checks.

Sequence & Stateful

MonoUp / MonoDown: Monotonicity (strict or non-strict).
NoTimeGaps(freq): Continuous time series check.
MaxGap(limit): Maximum interval size check.

Value & Custom

Between(low, high) / Outside(low, high): Range checks.
OneOf(*values): Set membership.
Is(predicate): Custom lambda/function element-wise check.
Rows(predicate): Custom row-wise DataFrame check.

📜 License

MIT License. See LICENSE for more information.

Project details

Release history Release notifications | RSS feed

0.1.2

Mar 8, 2026

This version

0.1.1

Mar 8, 2026

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawarden-0.1.1.tar.gz (77.5 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datawarden-0.1.1-py3-none-any.whl (37.1 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file datawarden-0.1.1.tar.gz.

File metadata

Download URL: datawarden-0.1.1.tar.gz
Upload date: Mar 8, 2026
Size: 77.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datawarden-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`feb9fe1ff4ad7c8427e5e37fa9703c8df39f94b8503171fbbe93a0d1224559e6`
MD5	`5650724684b1662fe40d164e5c35ca15`
BLAKE2b-256	`b83e372ebe144a9f827d076a4600d06b78578c5adc00de9036dabb1cf2580e62`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datawarden-0.1.1.tar.gz:

Publisher: publish.yml on sencer/datawarden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datawarden-0.1.1.tar.gz
- Subject digest: feb9fe1ff4ad7c8427e5e37fa9703c8df39f94b8503171fbbe93a0d1224559e6
- Sigstore transparency entry: 1060689995
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: sencer/datawarden@9c54b239018f62733b49190d4fa30b250af91a29
- Branch / Tag: refs/tags/0.1.1
- Owner: https://github.com/sencer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9c54b239018f62733b49190d4fa30b250af91a29
- Trigger Event: release

File details

Details for the file datawarden-0.1.1-py3-none-any.whl.

File metadata

Download URL: datawarden-0.1.1-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datawarden-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eaa39f09d79e415d42e45a31822b6231c879501cf38933f2ac32e0ec04c85e41`
MD5	`82dcad12e280d9803cb79fb46930512a`
BLAKE2b-256	`e78e8723d4951b0ea9f4a67c73beebaf0df88a6ef62950eec3db8e20dacc8338`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datawarden-0.1.1-py3-none-any.whl:

Publisher: publish.yml on sencer/datawarden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datawarden-0.1.1-py3-none-any.whl
- Subject digest: eaa39f09d79e415d42e45a31822b6231c879501cf38933f2ac32e0ec04c85e41
- Sigstore transparency entry: 1060689999
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: sencer/datawarden@9c54b239018f62733b49190d4fa30b250af91a29
- Branch / Tag: refs/tags/0.1.1
- Owner: https://github.com/sencer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9c54b239018f62733b49190d4fa30b250af91a29
- Trigger Event: release

datawarden 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

datawarden

🚀 Why Datawarden?

📦 Installation

🛠️ Quick Start

💎 Advanced Features

🔗 Logical Composition

📊 N-ary Column Comparisons

📈 Stateful Sequence Validation

⚡ Performance Benchmarks

🛠️ Configuration

📖 Available Validators

Structural

Numeric

Sequence & Stateful

Value & Custom

📜 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance