Create validation classes for your data
Project description
ValFrame: Schema-Validated DataFrames for Robust Data Pipelines
ValFrame is a Python library for creating self-validating DataFrame types using Pandera schemas, supporting both in-memory and out-of-core (folder-based) data.
The core motivation is to leverage Python's type system to guarantee data validity at runtime. By creating specific, validated ValFrame types, you can write functions that are guaranteed (by using the @beartype decorator) to receive data with the correct shape and characteristics, preventing downstream errors and making your data pipelines more robust and reliable.
Quick Start
Install ValFrame from PyPI:
pip install valframe
Define a schema and create a validated DataFrame type:
import pandas as pd
import pandera.pandas as pa
from valframe import create_valframe_type
# Define a schema for your data
UserSchema = pa.DataFrameSchema({
"user_id": pa.Column(int, pa.Check.ge(0)),
"name": pa.Column(str)
})
# Create a validated DataFrame type
UserDataFrame = create_valframe_type("UserDataFrame", UserSchema, library="pandas")
# This succeeds
valid_df = UserDataFrame(pd.DataFrame({"user_id": [1, 2], "name": ["Alice", "Bob"]}))
# This will raise a pandera.errors.SchemaError
invalid_df = UserDataFrame(pd.DataFrame({"user_id": [-1, 0], "name": ["Carl", "Eve"]}))
Features
- Schema-First Validation: Build DataFrame types directly from Pandera schemas.
- In-Memory Validation: Create DataFrame objects that validate their contents upon instantiation.
- Folder-Based Virtual Frames: Treat a directory of data files as a single, indexable DataFrame without loading the entire dataset into memory.
- Pandas & Polars Support: Works seamlessly with both major DataFrame libraries.
- Lazy Validation: Defer validation on folder-based frames until data is accessed for faster initialization.
- Type System Integration: Designed to work with type checkers like
beartypeto provide strong runtime guarantees about data contracts.
Supported Formats
ValFrame's folder-based mode supports reading from the following file formats:
csvparquet
Relative Positioning
ValFrame occupies a unique niche by providing a balance of high data integrity and moderate processing efficiency.
- Unlike
pydantic-pandas, it uses vectorized validation via Pandera, making it significantly more performant on large datasets, especially with Polars. - Compared to high-scale tools like Polars (lazy mode) or Dask, ValFrame's integrity guarantee is inherent and automatic, whereas in lazy frameworks, validation is a manual step that must be explicitly added to the computation graph.
- While orchestration frameworks like Dagster provide pipeline-level integrity, ValFrame offers a lightweight, low-complexity solution perfect for "medium data" problems—datasets too large for memory but too simple to require a full data engineering framework.
Installation
Install the package directly from PyPI:
pip install valframe
Dependencies
- Python 3.10+
pandera[polars]pandaspolarsbeartypenumpy
In-Depth Example: Data Integrity with beartype
This example demonstrates how to combine valframe and beartype to create a function that is guaranteed to receive valid data, preventing runtime errors.
import pandas as pd
import pandera.pandas as pa
from beartype import beartype
from valframe import create_valframe_type
# 1. Define a strict schema for transaction data
TransactionSchema = pa.DataFrameSchema(
{
"transaction_id": pa.Column(str, pa.Check.str_startswith("txn_")),
"amount_usd": pa.Column(float, pa.Check.gt(0)),
"seller_id": pa.Column(int, pa.Check.ge(1000)),
},
strict=True, # Disallow any columns not defined in the schema
ordered=True, # Enforce column order
)
# 2. Create a specific, validated DataFrame type for this schema
TransactionDataFrame = create_valframe_type(
"TransactionDataFrame", TransactionSchema, library="pandas"
)
# 3. Use @beartype to enforce that our function ONLY accepts this type
@beartype
def process_payouts(transactions: TransactionDataFrame) -> float:
"""
Calculates the total payout amount from a validated DataFrame of transactions.
Because of the @beartype decorator and the TransactionDataFrame type,
we are 100% certain that the `transactions` argument is a pandas DataFrame
and that its contents conform to the TransactionSchema.
"""
print("Payout processing started on valid data...")
total_payout = transactions["amount_usd"].sum()
return total_payout
# --- Main execution ---
if __name__ == "__main__":
# a) Create a valid DataFrame
valid_data = pd.DataFrame({
"transaction_id": ["txn_123", "txn_456"],
"amount_usd": [150.50, 75.00],
"seller_id": [1001, 1024],
})
# Instantiate our validated type. This succeeds.
validated_transactions = TransactionDataFrame(valid_data)
total = process_payouts(validated_transactions)
print(f"Total payout is: ${total:.2f}") # Output: Total payout is: $225.50
print("-" * 20)
# b) Create an invalid DataFrame
invalid_data = pd.DataFrame({
"transaction_id": ["txn_789", "inv_000"], # "inv_000" is invalid
"amount_usd": [99.99, 50.00],
"seller_id": [1050, 999], # 999 is invalid
})
try:
# This line will fail immediately upon instantiation,
# preventing the invalid data from ever reaching our function.
invalid_transactions = TransactionDataFrame(invalid_data)
process_payouts(invalid_transactions)
except pa.errors.SchemaError as e:
print("Failed to create DataFrame due to validation errors:")
print(e)
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file valframe-0.0.1.0.tar.gz.
File metadata
- Download URL: valframe-0.0.1.0.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0281b0f5abe256226a154417f6bafe8808e3c2526404fe15ca5ce11f4347c229
|
|
| MD5 |
0bfe3a6bc9908f6a3ba6afef55abde8c
|
|
| BLAKE2b-256 |
c525d51ce63ad6992b5830abf9df76a88f50825b570150da4beca0b9d27b04cd
|
File details
Details for the file valframe-0.0.1.0-py3-none-any.whl.
File metadata
- Download URL: valframe-0.0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
379e262bf24742d888a16e507a771016485652d72ed27943fdbe3ce2e8f7e6c3
|
|
| MD5 |
32860b6f3237690374a5330452ea14f3
|
|
| BLAKE2b-256 |
b8d8da6a09f82878e68cdb65623b91f0a361c745685ae24c6a71f0145e5c980c
|