Skip to main content

Create validation classes for your data

Project description

ValFrame: Schema-Validated DataFrames for Robust Data Pipelines

ValFrame is a Python library for creating self-validating DataFrame types using Pandera schemas, supporting both in-memory and out-of-core (folder-based) data.

The core motivation is to leverage Python's type system to guarantee data validity at runtime. By creating specific, validated ValFrame types, you can write functions that are guaranteed (by using the @beartype decorator) to receive data with the correct shape and characteristics, preventing downstream errors and making your data pipelines more robust and reliable.


Quick Start

Install ValFrame from PyPI:

pip install valframe

Define a schema and create a validated DataFrame type:

import pandas as pd
import pandera.pandas as pa
from valframe import create_valframe_type

# Define a schema for your data
UserSchema = pa.DataFrameSchema({
    "user_id": pa.Column(int, pa.Check.ge(0)),
    "name": pa.Column(str)
})

# Create a validated DataFrame type
UserDataFrame = create_valframe_type("UserDataFrame", UserSchema, library="pandas")

# This succeeds
valid_df = UserDataFrame(pd.DataFrame({"user_id": [1, 2], "name": ["Alice", "Bob"]}))

# This will raise a pandera.errors.SchemaError
invalid_df = UserDataFrame(pd.DataFrame({"user_id": [-1, 0], "name": ["Carl", "Eve"]}))

Features

  • Schema-First Validation: Build DataFrame types directly from Pandera schemas.
  • In-Memory Validation: Create DataFrame objects that validate their contents upon instantiation.
  • Folder-Based Virtual Frames: Treat a directory of data files as a single, indexable DataFrame without loading the entire dataset into memory.
  • Pandas & Polars Support: Works seamlessly with both major DataFrame libraries.
  • Lazy Validation: Defer validation on folder-based frames until data is accessed for faster initialization.
  • Type System Integration: Designed to work with type checkers like beartype to provide strong runtime guarantees about data contracts.

Supported Formats

ValFrame's folder-based mode supports reading from the following file formats:

  • csv
  • parquet

Relative Positioning

ValFrame occupies a unique niche by providing a balance of high data integrity and moderate processing efficiency.

  • Unlike pydantic-pandas, it uses vectorized validation via Pandera, making it significantly more performant on large datasets, especially with Polars.
  • Compared to high-scale tools like Polars (lazy mode) or Dask, ValFrame's integrity guarantee is inherent and automatic, whereas in lazy frameworks, validation is a manual step that must be explicitly added to the computation graph.
  • While orchestration frameworks like Dagster provide pipeline-level integrity, ValFrame offers a lightweight, low-complexity solution perfect for "medium data" problems—datasets too large for memory but too simple to require a full data engineering framework.

Installation

Install the package directly from PyPI:

pip install valframe

Dependencies

  • Python 3.10+
  • pandera[polars]
  • pandas
  • polars
  • beartype
  • numpy

In-Depth Example: Data Integrity with beartype

This example demonstrates how to combine valframe and beartype to create a function that is guaranteed to receive valid data, preventing runtime errors.

import pandas as pd
import pandera.pandas as pa
from beartype import beartype
from valframe import create_valframe_type

# 1. Define a strict schema for transaction data
TransactionSchema = pa.DataFrameSchema(
    {
        "transaction_id": pa.Column(str, pa.Check.str_startswith("txn_")),
        "amount_usd": pa.Column(float, pa.Check.gt(0)),
        "seller_id": pa.Column(int, pa.Check.ge(1000)),
    },
    strict=True,  # Disallow any columns not defined in the schema
    ordered=True, # Enforce column order
)

# 2. Create a specific, validated DataFrame type for this schema
TransactionDataFrame = create_valframe_type(
    "TransactionDataFrame", TransactionSchema, library="pandas"
)

# 3. Use @beartype to enforce that our function ONLY accepts this type
@beartype
def process_payouts(transactions: TransactionDataFrame) -> float:
    """
    Calculates the total payout amount from a validated DataFrame of transactions.

    Because of the @beartype decorator and the TransactionDataFrame type,
    we are 100% certain that the `transactions` argument is a pandas DataFrame
    and that its contents conform to the TransactionSchema.
    """
    print("Payout processing started on valid data...")
    total_payout = transactions["amount_usd"].sum()
    return total_payout

# --- Main execution ---
if __name__ == "__main__":
    # a) Create a valid DataFrame
    valid_data = pd.DataFrame({
        "transaction_id": ["txn_123", "txn_456"],
        "amount_usd": [150.50, 75.00],
        "seller_id": [1001, 1024],
    })

    # Instantiate our validated type. This succeeds.
    validated_transactions = TransactionDataFrame(valid_data)
    total = process_payouts(validated_transactions)
    print(f"Total payout is: ${total:.2f}") # Output: Total payout is: $225.50

    print("-" * 20)

    # b) Create an invalid DataFrame
    invalid_data = pd.DataFrame({
        "transaction_id": ["txn_789", "inv_000"], # "inv_000" is invalid
        "amount_usd": [99.99, 50.00],
        "seller_id": [1050, 999], # 999 is invalid
    })

    try:
        # This line will fail immediately upon instantiation,
        # preventing the invalid data from ever reaching our function.
        invalid_transactions = TransactionDataFrame(invalid_data)
        process_payouts(invalid_transactions)
    except pa.errors.SchemaError as e:
        print("Failed to create DataFrame due to validation errors:")
        print(e)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valframe-0.0.1.0.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

valframe-0.0.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file valframe-0.0.1.0.tar.gz.

File metadata

  • Download URL: valframe-0.0.1.0.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for valframe-0.0.1.0.tar.gz
Algorithm Hash digest
SHA256 0281b0f5abe256226a154417f6bafe8808e3c2526404fe15ca5ce11f4347c229
MD5 0bfe3a6bc9908f6a3ba6afef55abde8c
BLAKE2b-256 c525d51ce63ad6992b5830abf9df76a88f50825b570150da4beca0b9d27b04cd

See more details on using hashes here.

File details

Details for the file valframe-0.0.1.0-py3-none-any.whl.

File metadata

  • Download URL: valframe-0.0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for valframe-0.0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 379e262bf24742d888a16e507a771016485652d72ed27943fdbe3ce2e8f7e6c3
MD5 32860b6f3237690374a5330452ea14f3
BLAKE2b-256 b8d8da6a09f82878e68cdb65623b91f0a361c745685ae24c6a71f0145e5c980c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page