Type-safe DataFrame library with schema validation for pandas
Project description
PandasSchemaster
Type-safe DataFrame library with schema validation for pandas.
Overview
PandasSchemaster provides a strongly-typed interface to pandas DataFrames with automatic validation, type conversion, and schema-based column access. Use df[MySchema.COLUMN] instead of df['column'] for type-safe, IDE-friendly DataFrame operations that inherit all pandas DataFrame functionality.
Key Features
- 🛡️ Type Safety: Schema-based column access prevents runtime errors
- 🔧 IDE Support: Autocompletion and error detection for column names
- ✅ Validation: Automatic data validation based on schema definitions
- 🔄 Auto-casting: Seamless data type conversions
- � Full DataFrame Compatibility: Inherits from pandas.DataFrame - all methods work
- �📖 Self-documenting: Clear, readable code with schema column references
Quick Start
Installation
pip install pandasschemaster
Basic Usage
import pandas as pd
import numpy as np
from pandasschemaster import SchemaColumn, SchemaDataFrame, BaseSchema
# Define your schema
class SensorSchema(BaseSchema):
TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
TEMPERATURE = SchemaColumn("temperature", np.float64)
HUMIDITY = SchemaColumn("humidity", np.float64)
SENSOR_ID = SchemaColumn("sensor_id", np.int64, nullable=False)
# Create data
data = {
'timestamp': [pd.Timestamp.now()],
'temperature': [23.5],
'humidity': [45.2],
'sensor_id': [1001]
}
# Create validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema, validate=True, auto_cast=True)
# Use schema columns for type-safe operations
temperature = df[SensorSchema.TEMPERATURE] # Instead of df['temperature']
fahrenheit = df[SensorSchema.TEMPERATURE] * 9/5 + 32
hot_readings = df[df[SensorSchema.TEMPERATURE] > 25]
# Multi-column selection
subset = df[[SensorSchema.TEMPERATURE, SensorSchema.HUMIDITY]]
# Assignment with automatic type casting
df[SensorSchema.TEMPERATURE] = [24.1]
Schema Column Benefits
✅ Type-Safe Access
# Type-safe schema column access
temperature = df[SensorSchema.TEMPERATURE]
# vs traditional string access (error-prone)
temperature = df['temperature'] # Typos not caught until runtime
🔧 IDE Support
- Autocompletion:
SensorSchema.shows available columns - Error Detection: Invalid column names highlighted
- Go-to-Definition: Jump to schema definition
🔄 Refactoring Safety
# Rename a schema column and all references update automatically
class SensorSchema(BaseSchema):
TEMP_CELSIUS = SchemaColumn("temperature_celsius", np.float64) # Renamed
# All df[SensorSchema.TEMP_CELSIUS] references work immediately
🐼 Full DataFrame Compatibility
SchemaDataFrame inherits directly from pandas.DataFrame, so all DataFrame methods work seamlessly:
# Create schema-validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema)
# Use all pandas DataFrame methods directly
print(df.shape) # (100, 4)
print(df.head()) # First 5 rows
summary = df.describe() # Statistical summary
grouped = df.groupby(SensorSchema.SENSOR_ID.name).mean()
# Mathematical operations
df_scaled = df * 2
df_filtered = df[df[SensorSchema.TEMPERATURE] > 25]
# All pandas operations work while maintaining schema validation
Advanced Features
Schema Column Types and Validation
class AdvancedSchema(BaseSchema):
# Basic column with nullable control
PRESSURE = SchemaColumn("pressure", np.float64, nullable=False)
# Column with default value
STATUS = SchemaColumn("status", np.dtype('object'),
default="UNKNOWN", nullable=True)
# Column with description
MACHINE_ID = SchemaColumn("machine_id", np.int64,
description="Unique machine identifier")
Data Type Casting and Conversion
# Auto-casting handles string to numeric conversion
data = {
'temperature': ["23.5", "24.1"], # String values
'sensor_id': ["1001", "1002"] # String values
}
df = SchemaDataFrame(data, schema_class=SensorSchema,
validate=True, auto_cast=True)
# Values are automatically cast to schema types
print(df.dtypes)
# temperature float64
# sensor_id Int64
Real-World Example
# Industrial IoT sensor data processing
class IndustrialSchema(BaseSchema):
TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
MACHINE_ID = SchemaColumn("machine_id", np.int64, nullable=False)
TEMPERATURE = SchemaColumn("temperature", np.float64)
PRESSURE = SchemaColumn("pressure", np.float64)
STATUS = SchemaColumn("status", np.dtype('object'))
# Load and validate data
df = SchemaDataFrame(sensor_data, schema_class=IndustrialSchema, validate=True)
# Type-safe analysis using schema columns
avg_temp_by_machine = df.groupby(IndustrialSchema.MACHINE_ID.name)[
IndustrialSchema.TEMPERATURE.name
].mean()
overheating = df[df[IndustrialSchema.TEMPERATURE] > 150]
efficiency = df[IndustrialSchema.PRESSURE] / df[IndustrialSchema.TEMPERATURE]
# Filter by status using schema column
running_machines = df[df[IndustrialSchema.STATUS] == 'RUNNING']
# Complex multi-column operations
subset = df.select_columns([IndustrialSchema.TEMPERATURE, IndustrialSchema.PRESSURE])
Key Features Demonstrated in Tests
Column Resolution and Access
# The library handles both string and SchemaColumn access
temp1 = df['temperature'] # Traditional string access
temp2 = df[SensorSchema.TEMPERATURE] # Schema column access
assert temp1.equals(temp2) # Both work identically
# Multi-column selection with mixed types
subset = df[[SensorSchema.TEMPERATURE, 'humidity']] # Mixed access works
Schema Validation
# Validation catches missing required columns
class StrictSchema(BaseSchema):
REQUIRED_COL = SchemaColumn("required", np.float64, nullable=False)
# This will raise validation errors
errors = StrictSchema.validate_dataframe(incomplete_df)
print(errors) # ['Required column required is missing']
Mathematical Operations
# All mathematical operations work with schema columns
celsius = df[SensorSchema.TEMPERATURE]
fahrenheit = celsius * 9/5 + 32
hot_mask = celsius > 25
comfort_index = celsius + df[SensorSchema.HUMIDITY] / 10
Core Components
SchemaColumn
Defines a typed column with validation and transformation capabilities.
# Basic column definition
temp_col = SchemaColumn("temperature", np.float64, nullable=True)
# Column with all options
advanced_col = SchemaColumn(
name="pressure",
dtype=np.float64,
nullable=False,
default=0.0,
description="Atmospheric pressure in hPa"
)
BaseSchema
Abstract base class for defining DataFrame schemas with class methods for validation.
class MySchema(BaseSchema):
COL1 = SchemaColumn("col1", np.float64)
COL2 = SchemaColumn("col2", np.int64)
# Get schema information
columns = MySchema.get_columns() # Dict of column definitions
names = MySchema.get_column_names() # List of column names
errors = MySchema.validate_dataframe(df) # Validation error list
SchemaDataFrame
Pandas DataFrame wrapper with schema validation and type-safe column access.
# All pandas DataFrame methods work
df = SchemaDataFrame(data, schema_class=MySchema)
print(df.shape) # Shape
print(df.head()) # First rows
summary = df.describe() # Statistics
filtered = df[df['col1'] > 5] # Filtering
# Plus schema-specific features
subset = df.select_columns([MySchema.COL1]) # Schema-based selection
print(df.schema) # Access to schema class
Requirements
- Python 3.8+
- pandas >= 2.0.0
- numpy >= 1.24.0
License
MIT License. See LICENSE for details.
Contributing
Contributions welcome! Please read our contributing guidelines and submit pull requests.
Support
- 🐛 Issues: GitHub Issues
- 💡 Questions: Use GitHub Discussions
Testing
The library includes comprehensive tests covering:
- Basic SchemaColumn functionality and type casting
- BaseSchema validation and column management
- SchemaDataFrame operations and pandas compatibility
- Mathematical operations and filtering with schema columns
- Column access resolution and multi-column selection
Run tests with:
python -m pytest tests/
Use df[MySchema.COLUMN] for type-safe DataFrame operations! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandasschemaster-1.0.0.tar.gz.
File metadata
- Download URL: pandasschemaster-1.0.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed026207c350e76d1454e3667872f0f23a93681da7f0d99c0e710383c0a0044
|
|
| MD5 |
2a6b3693524aaae379bf201f1f99ddf1
|
|
| BLAKE2b-256 |
aa9890fc24f52184c9c47226e6f20d80ec7f66f1378c7ac0f62e8e5412a8afd7
|
Provenance
The following attestation bundles were made for pandasschemaster-1.0.0.tar.gz:
Publisher:
python-publish.yml on gzocche/PandasSchemaster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandasschemaster-1.0.0.tar.gz -
Subject digest:
fed026207c350e76d1454e3667872f0f23a93681da7f0d99c0e710383c0a0044 - Sigstore transparency entry: 241566425
- Sigstore integration time:
-
Permalink:
gzocche/PandasSchemaster@faaa04c71b66f53aeabd7832d915b0b9c3f93885 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/gzocche
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@faaa04c71b66f53aeabd7832d915b0b9c3f93885 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pandasschemaster-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pandasschemaster-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67d2df667e26ebd3b4e272d9cd2a6c721a0753d86b7ff6bdc2e3ab1b2f85c541
|
|
| MD5 |
1f591f482d726d191a467a9a2fd0d134
|
|
| BLAKE2b-256 |
1175d43303ef0b12113a9c2a29e65cf5c17da7e91f20dd04602ef7e6e8944f14
|
Provenance
The following attestation bundles were made for pandasschemaster-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on gzocche/PandasSchemaster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandasschemaster-1.0.0-py3-none-any.whl -
Subject digest:
67d2df667e26ebd3b4e272d9cd2a6c721a0753d86b7ff6bdc2e3ab1b2f85c541 - Sigstore transparency entry: 241566435
- Sigstore integration time:
-
Permalink:
gzocche/PandasSchemaster@faaa04c71b66f53aeabd7832d915b0b9c3f93885 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/gzocche
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@faaa04c71b66f53aeabd7832d915b0b9c3f93885 -
Trigger Event:
release
-
Statement type: