Skip to main content

datacontracts v0.1.3 lets you define business-friendly data rules and reports every invalid row with clear messages.

Project description

datacontracts: Minimal Data Contracts for Pandas

PyPI version Python versions License

A small Python library for enforcing explicit data contracts on pandas DataFrames.

datacontracts lets you define business rules for your data and:

  • fail fast when data is invalid
  • optionally auto-correct safe violations

Minimal, explicit, and predictable.


Why this exists

The flexibility of the pandas library, while powerful, can be a source of silent data quality issues:

  • wrong types
  • out-of-range values
  • unexpected categories

These issues are usually discovered late — in dashboards, models, or production.

datacontracts stops bad data early.


Installation

pip install datacontracts

Usage (v0.1.5)

The core workflow uses Python classes to define the contract, making it explicit and readable.

Quick Example: Fail Fast (Default)

By default, datacontracts operates in its traditional fail fast mode, reporting all violations with clear, row-level error messages.

1. Define a Contract

We use expressive, business-friendly rules like lt, gt, and between.

from datacontracts import Contract, Column
import pandas as pd

class ProductContract(Contract):
    # Must be less than 100
    price = Column(int, lt=100)
    
    # Must be between 1 and 9 (inclusive)
    stock = Column(int, between=(1, 9))

2. Validate (Fail Fast)

df = pd.DataFrame({
    "price": [99, 120, 50], # 120 is invalid
    "stock": [5, 15, 0]     # 15 and 0 are invalid
})

# This will raise a ContractError, reporting all three violations
ProductContract.validate(df) 

New in v0.1.5: Validate and Auto-Correct

For safe, non-ambiguous violations (like type coercion or clamping to a boundary), v0.1.5 introduces an optional auto-correction mode. This allows data to flow while ensuring it meets the contract's specification.

3. Validate and Fix

Pass fix=True to the validate method. The method will return the corrected DataFrame and log any changes made.

# Example data with a type violation (float instead of int) and a range violation
df_to_fix = pd.DataFrame({
    "price": [99.5, 120, 50], # 99.5 (type violation), 120 (range violation)
    "stock": [5, 15, 0]
})

# This returns a corrected DataFrame and logs the changes
corrected_df = ProductContract.validate(df_to_fix, fix=True)

# corrected_df will now have:
# price: [99, 100, 50] (99.5 coerced to 99, 120 clamped to 100)
# stock: [5, 9, 1] (15 clamped to 9, 0 clamped to 1)

Note: Auto-correction is only applied to violations where the fix is explicit and safe (e.g., clamping a value to a defined boundary, or coercing a float to an integer). Ambiguous violations (like missing values or unexpected categories) will still raise an error unless explicitly handled.


Contract Specification Details

The Column object supports the following constraints:

Constraint Type Description
Type type (e.g., int, str, float) The required Python type for the column's values. Coercible types can be fixed with fix=True.
lt Number Less than (e.g., lt=100). Violations can be clamped with fix=True.
gt Number Greater than (e.g., gt=50). Violations can be clamped with fix=True.
between Tuple[Number, Number] Inclusive range (e.g., between=(1, 9)). Violations can be clamped with fix=True.
allowed list or set A collection of all permissible categorical values.
unique bool If True, all values in the column must be unique (no duplicates).

Scope and Philosophy

Correctness Before Convenience

The introduction of fix=True does not compromise the library's core philosophy.

  • Explicit Control: Auto-correction is opt-in. The default remains fail fast.
  • Safe Violations Only: Only violations with clear, deterministic fixes (clamping, type coercion) are corrected. Violations that require business logic (e.g., unexpected categories) still raise an error.
  • Transparency: All corrections are logged, ensuring a clear audit trail of data modifications.

What this library does NOT do

  • SQL or database-level validation
  • Spark or distributed data processing
  • Statistical drift detection or complex profiling
  • Schema inference (contracts must be explicit)

Development

Run tests:

python -m pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacontracts-0.1.5.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacontracts-0.1.5-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file datacontracts-0.1.5.tar.gz.

File metadata

  • Download URL: datacontracts-0.1.5.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacontracts-0.1.5.tar.gz
Algorithm Hash digest
SHA256 3099b7ab285cb2aab88fbf0cbf8dd7b079a7f8f5bd81cdf1b4cac3f96fd22ef5
MD5 f1fef77b002e195fa272d0c642a40c8f
BLAKE2b-256 f5ed82f41591c90bd3980926ee28bab196dd0065bff79e29e93a535491f7f1ec

See more details on using hashes here.

File details

Details for the file datacontracts-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: datacontracts-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacontracts-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a9218c675207ef993a41e7ad255b62c1f92663216b65f86aea7150c4f3e50ea2
MD5 c9988ff741a91ac201a900bd51200c62
BLAKE2b-256 ec984faa359906a0f2064bbc2564d92b133ca98aa970d2c51dc580a2cbbb50ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page