Skip to main content

Declarative data validation library for Pandas

Project description

datawarden

CI codecov

High-performance, JIT-accelerated data validation for Pandas and NumPy.

datawarden is a high-performance validation library that provides a clean, type-safe way to express data validation constraints directly in function signatures. It utilizes Python type hints to declare validation rules, which are then compiled into optimized machine code using Numba JIT for near-zero runtime overhead.


🚀 Why Datawarden?

  • 🎯 Type-Safe Declarations: Use Annotated types (Validated[T, ...]) to define constraints directly in your function signatures.
  • Numba JIT Acceleration: Complex logical chains are fused and compiled, achieving up to 75x speedups over vectorized NumPy/Pandas for certain operations.
  • 🧵 Parallel Execution: Automatically validates multiple function arguments in parallel using a thread pool.
  • 📦 Memory Efficient: Supports chunked validation, allowing you to validate datasets larger than your RAM with O(1) memory overhead.
  • 🔧 N-ary Comparisons: Compare multiple columns (e.g., Ge('high', 'low', 'open')) with zero-copy JIT execution.
  • 🔄 Cross-Chunk Continuity: Built-in support for stateful sequence validation (e.g., monotonicity across streaming data chunks).

📦 Installation

pip install datawarden

Or with uv:

uv add datawarden

🛠️ Quick Start

import pandas as pd
import numpy as np
from datawarden import validate, Validated, Gt, Finite, NotEmpty

@validate
def calculate_returns(
    prices: Validated[pd.Series, NotEmpty, Finite],
    threshold: Validated[float, Gt(0)] = 0.01
) -> pd.Series:
    """
    prices is validated to be NotEmpty and have only Finite values (no NaN/Inf).
    threshold is validated to be > 0.
    """
    return prices.pct_change()

# Valid data passes through
prices = pd.Series([100.0, 102.0, 101.0, 103.0])
returns = calculate_returns(prices)

# Invalid data raises ValidationError with a detailed report
bad_prices = pd.Series([100.0, np.nan, 102.0])
# Raises: ValidationError: Data contains non-finite values (NaN/Inf)
calculate_returns(bad_prices)

💎 Advanced Features

🔗 Logical Composition

Combine validators using standard Python logical operators. datawarden will fuse these into a single optimized pass.

from datawarden import Ge, Le, IsNaN

# Value must be between 0 and 1, or can be NaN
UnitValue = Validated[pd.Series, (Ge(0) & Le(1)) | IsNaN()]

📊 N-ary Column Comparisons

Validate relationships across multiple columns in a DataFrame without manual iteration or heavy Pandas operations.

from datawarden import Ge

# Validates that 'max' >= 'min' AND 'min' >= 'base' for all rows
@validate
def check_bounds(df: Validated[pd.DataFrame, Ge('max', 'min', 'base')]):
    ...

📈 Stateful Sequence Validation

Maintain validation state across data chunks – essential for streaming pipelines.

from datawarden import MonoUp, NoTimeGaps

# Ensure timestamps are strictly increasing and have no gaps across all chunks
@validate
def ingest_stream(chunk: Validated[pd.DataFrame, Index(MonoUp(strict=True) & NoTimeGaps("1min"))]):
    ...

⚡ Performance Benchmarks

datawarden is built for speed. By fusing operations and avoiding intermediate allocations, it significantly outperforms standard approaches on large datasets (~10M+ rows).

Operation Pandas/NumPy Datawarden (JIT) Improvement
Ge(0) & Le(1) ~15ms ~0.2ms 75x
MonoUp (Monotonic) ~24ms ~8ms 3x
Multi-column Ge ~45ms ~0.5ms 90x

[!NOTE] Benchmarks performed on a modern CPU with 10M rows. Numba fusion provides the biggest gains for complex logical chains.


🛠️ Configuration

Fine-tune the behavior of datawarden using the Overrides context manager or global config.

from datawarden import Overrides

# Process a massive dataset in chunks to save memory
with Overrides(chunk_size_rows=100_000, use_numba=True):
    my_heavy_function(massive_df)

# Disable validation for an entire module during import to avoid redundant checks
with Overrides(skip_validation=True):
    import sensitive_library_already_validated

[!NOTE] Overrides(skip_validation=True) is particularly useful when importing a library that uses datawarden internally, but you've already validated the data upstream or want to disable validation for performance in a production environment.

Option Default Description
skip_validation False Globally disable validation for production hot-loops.
warn_only False Log a warning instead of raising ValidationError.
chunk_size_rows None Automatically split large data into chunks for memory efficiency.
use_numba True Enable/Disable JIT compilation via Numba.
parallel_threshold 100,000 Minimum row count to trigger parallel multi-argument validation.

📖 Available Validators

Structural

  • Index(validator): Apply any validator to the data index.
  • Columns(validator): Validate column names/presence.
  • Column(name, validator): Apply validator to a specific column.
  • Shape(rows, cols): Validate container dimensions.
  • NotEmpty / Empty: Check for content existence.

Numeric

  • Gt, Ge, Lt, Le, Eq, Ne: Standard comparisons (with multi-column support).
  • Finite: No NaN or Inf.
  • NotNaN / IsNaN: Null checks.
  • Positive / Negative / NonNegative / NonPositive: Sign checks.

Sequence & Stateful

  • MonoUp / MonoDown: Monotonicity (strict or non-strict).
  • NoTimeGaps(freq): Continuous time series check.
  • MaxGap(limit): Maximum interval size check.

Value & Custom

  • Between(low, high) / Outside(low, high): Range checks.
  • OneOf(*values): Set membership.
  • Is(predicate): Custom lambda/function element-wise check.
  • Rows(predicate): Custom row-wise DataFrame check.

📜 License

MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawarden-0.1.1.tar.gz (77.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datawarden-0.1.1-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file datawarden-0.1.1.tar.gz.

File metadata

  • Download URL: datawarden-0.1.1.tar.gz
  • Upload date:
  • Size: 77.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datawarden-0.1.1.tar.gz
Algorithm Hash digest
SHA256 feb9fe1ff4ad7c8427e5e37fa9703c8df39f94b8503171fbbe93a0d1224559e6
MD5 5650724684b1662fe40d164e5c35ca15
BLAKE2b-256 b83e372ebe144a9f827d076a4600d06b78578c5adc00de9036dabb1cf2580e62

See more details on using hashes here.

Provenance

The following attestation bundles were made for datawarden-0.1.1.tar.gz:

Publisher: publish.yml on sencer/datawarden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datawarden-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: datawarden-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datawarden-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eaa39f09d79e415d42e45a31822b6231c879501cf38933f2ac32e0ec04c85e41
MD5 82dcad12e280d9803cb79fb46930512a
BLAKE2b-256 e78e8723d4951b0ea9f4a67c73beebaf0df88a6ef62950eec3db8e20dacc8338

See more details on using hashes here.

Provenance

The following attestation bundles were made for datawarden-0.1.1-py3-none-any.whl:

Publisher: publish.yml on sencer/datawarden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page