Type-safe DataFrame library with schema validation for pandas
Project description
🐼 PandasSchemaster
Type-safe DataFrame operations with schema validation for pandas
Transform your pandas DataFrames from
df['column']todf[Schema.COLUMN]for bulletproof, IDE-friendly data operations!
🎯 Why PandasSchemaster?
Before: Error-prone string-based column access
df['temprature'] # Typo - runtime error! 😱
df['temperatuur'] # Wrong column name - silent failure! 💥
After: Type-safe schema-based column access
df[SensorSchema.TEMPERATURE] # IDE autocomplete + compile-time checking! ✨
✨ Key Features
- 🛡️ Type Safety: Schema-based column access prevents runtime errors
- 🔧 IDE Support: Full autocompletion and error detection for column names
- ✅ Validation: Automatic data validation based on schema definitions
- 🔄 Auto-casting: Seamless data type conversions
- � Full DataFrame Compatibility: Inherits from pandas.DataFrame - all methods work
- �📖 Self-documenting: Clear, readable code with schema column references
Quick Start
Installation
pip install pandasschemaster
Basic Usage
import pandas as pd
import numpy as np
from pandasschemaster import SchemaColumn, SchemaDataFrame, BaseSchema
# Define your schema
class SensorSchema(BaseSchema):
TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
TEMPERATURE = SchemaColumn("temperature", np.float64)
HUMIDITY = SchemaColumn("humidity", np.float64)
SENSOR_ID = SchemaColumn("sensor_id", np.int64, nullable=False)
# Create data
data = {
'timestamp': [pd.Timestamp.now()],
'temperature': [23.5],
'humidity': [45.2],
'sensor_id': [1001]
}
# Create validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema, validate=True, auto_cast=True)
# Use schema columns for type-safe operations
temperature = df[SensorSchema.TEMPERATURE] # Instead of df['temperature']
fahrenheit = df[SensorSchema.TEMPERATURE] * 9/5 + 32
hot_readings = df[df[SensorSchema.TEMPERATURE] > 25]
# Multi-column selection
subset = df[[SensorSchema.TEMPERATURE, SensorSchema.HUMIDITY]]
# Assignment with automatic type casting
df[SensorSchema.TEMPERATURE] = [24.1]
Command-Line Schema Generator
PandasSchemaster includes a powerful CLI tool to automatically generate schema classes from your data files:
# Generate schema from CSV and print to console
python Scripts/generate_schema.py data.csv
# Save schema to file with custom class name
python Scripts/generate_schema.py data.csv -o my_schema.py -c CustomerSchema
# Sample large files for faster processing
python Scripts/generate_schema.py large_data.csv -s 1000 -v
# On Windows, you can also use the batch file
Scripts\generate_schema.bat data.csv -o schema.py -c MySchema
# On Unix/Linux, you can use the shell script
./Scripts/generate_schema.sh data.csv -o schema.py -c MySchema
Supported File Formats
- CSV (
.csv) - Comma-separated values - Excel (
.xlsx,.xls) - Microsoft Excel files - JSON (
.json) - JavaScript Object Notation - Parquet (
.parquet) - Apache Parquet format - TSV/TXT (
.tsv,.txt) - Tab-separated values
CLI Options
| Option | Description | Example |
|---|---|---|
input_file |
Path to data file (required) | data.csv |
-o, --output |
Output file for schema | -o schema.py |
-c, --class-name |
Custom class name | -c CustomerSchema |
-s, --sample-size |
Number of rows to analyze | -s 1000 |
--no-nullable |
Disable nullable inference | --no-nullable |
-v, --verbose |
Enable detailed logging | -v |
The generator automatically detects data types (numeric, boolean, datetime, string) and creates properly typed schema classes. For detailed usage examples, see CLI_USAGE.md.
Schema Column Benefits
✅ Type-Safe Access
# Type-safe schema column access
temperature = df[SensorSchema.TEMPERATURE]
# vs traditional string access (error-prone)
temperature = df['temperature'] # Typos not caught until runtime
🔧 IDE Support
- Autocompletion:
SensorSchema.shows available columns - Error Detection: Invalid column names highlighted
- Go-to-Definition: Jump to schema definition
🔄 Refactoring Safety
# Rename a schema column and all references update automatically
class SensorSchema(BaseSchema):
TEMP_CELSIUS = SchemaColumn("temperature_celsius", np.float64) # Renamed
# All df[SensorSchema.TEMP_CELSIUS] references work immediately
🐼 Full DataFrame Compatibility
SchemaDataFrame inherits directly from pandas.DataFrame, so all DataFrame methods work seamlessly:
# Create schema-validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema)
# Use all pandas DataFrame methods directly
print(df.shape) # (100, 4)
print(df.head()) # First 5 rows
summary = df.describe() # Statistical summary
grouped = df.groupby(SensorSchema.SENSOR_ID.name).mean()
# Mathematical operations
df_scaled = df * 2
df_filtered = df[df[SensorSchema.TEMPERATURE] > 25]
# All pandas operations work while maintaining schema validation
Advanced Features
Schema Column Types and Validation
class AdvancedSchema(BaseSchema):
# Basic column with nullable control
PRESSURE = SchemaColumn("pressure", np.float64, nullable=False)
# Column with default value
STATUS = SchemaColumn("status", np.dtype('object'),
default="UNKNOWN", nullable=True)
# Column with description
MACHINE_ID = SchemaColumn("machine_id", np.int64,
description="Unique machine identifier")
Data Type Casting and Conversion
# Auto-casting handles string to numeric conversion
data = {
'temperature': ["23.5", "24.1"], # String values
'sensor_id': ["1001", "1002"] # String values
}
df = SchemaDataFrame(data, schema_class=SensorSchema,
validate=True, auto_cast=True)
# Values are automatically cast to schema types
print(df.dtypes)
# temperature float64
# sensor_id Int64
Real-World Example
# Industrial IoT sensor data processing
class IndustrialSchema(BaseSchema):
TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
MACHINE_ID = SchemaColumn("machine_id", np.int64, nullable=False)
TEMPERATURE = SchemaColumn("temperature", np.float64)
PRESSURE = SchemaColumn("pressure", np.float64)
STATUS = SchemaColumn("status", np.dtype('object'))
# Load and validate data
df = SchemaDataFrame(sensor_data, schema_class=IndustrialSchema, validate=True)
# Type-safe analysis using schema columns
avg_temp_by_machine = df.groupby(IndustrialSchema.MACHINE_ID.name)[
IndustrialSchema.TEMPERATURE.name
].mean()
overheating = df[df[IndustrialSchema.TEMPERATURE] > 150]
efficiency = df[IndustrialSchema.PRESSURE] / df[IndustrialSchema.TEMPERATURE]
# Filter by status using schema column
running_machines = df[df[IndustrialSchema.STATUS] == 'RUNNING']
# Complex multi-column operations
subset = df.select_columns([IndustrialSchema.TEMPERATURE, IndustrialSchema.PRESSURE])
Key Features Demonstrated in Tests
Column Resolution and Access
# The library handles both string and SchemaColumn access
temp1 = df['temperature'] # Traditional string access
temp2 = df[SensorSchema.TEMPERATURE] # Schema column access
assert temp1.equals(temp2) # Both work identically
# Multi-column selection with mixed types
subset = df[[SensorSchema.TEMPERATURE, 'humidity']] # Mixed access works
Schema Validation
# Validation catches missing required columns
class StrictSchema(BaseSchema):
REQUIRED_COL = SchemaColumn("required", np.float64, nullable=False)
# This will raise validation errors
errors = StrictSchema.validate_dataframe(incomplete_df)
print(errors) # ['Required column required is missing']
Mathematical Operations
# All mathematical operations work with schema columns
celsius = df[SensorSchema.TEMPERATURE]
fahrenheit = celsius * 9/5 + 32
hot_mask = celsius > 25
comfort_index = celsius + df[SensorSchema.HUMIDITY] / 10
Core Components
SchemaColumn
Defines a typed column with validation and transformation capabilities.
# Basic column definition
temp_col = SchemaColumn("temperature", np.float64, nullable=True)
# Column with all options
advanced_col = SchemaColumn(
name="pressure",
dtype=np.float64,
nullable=False,
default=0.0,
description="Atmospheric pressure in hPa"
)
BaseSchema
Abstract base class for defining DataFrame schemas with class methods for validation.
class MySchema(BaseSchema):
COL1 = SchemaColumn("col1", np.float64)
COL2 = SchemaColumn("col2", np.int64)
# Get schema information
columns = MySchema.get_columns() # Dict of column definitions
names = MySchema.get_column_names() # List of column names
errors = MySchema.validate_dataframe(df) # Validation error list
SchemaDataFrame
Pandas DataFrame wrapper with schema validation and type-safe column access.
# All pandas DataFrame methods work
df = SchemaDataFrame(data, schema_class=MySchema)
print(df.shape) # Shape
print(df.head()) # First rows
summary = df.describe() # Statistics
filtered = df[df['col1'] > 5] # Filtering
# Plus schema-specific features
subset = df.select_columns([MySchema.COL1]) # Schema-based selection
print(df.schema) # Access to schema class
📚 Documentation
| Document | Description |
|---|---|
| 🚀 Quick Start | Get started in 30 seconds |
| 📖 API Reference | Complete API documentation |
| 🔧 CLI Usage Guide | Command-line tool documentation |
| 🎯 Examples & Tutorials | Real-world examples and patterns |
| 🤝 Contributing | How to contribute |
| 📋 Changelog | Version history |
📞 Support & Community
- 🐛 Bug Reports: GitHub Issues
- 💡 Feature Requests: GitHub Issues
- 💬 Questions: GitHub Discussions
- � Email: your.email@example.com
🏆 Why Choose PandasSchemaster?
| Feature | PandasSchemaster | Regular Pandas |
|---|---|---|
| Type Safety | ✅ Compile-time column checking | ❌ Runtime string errors |
| IDE Support | ✅ Full autocompletion | ❌ No column suggestions |
| Refactoring | ✅ Safe column renaming | ❌ Manual find-replace |
| Validation | ✅ Automatic data validation | ❌ Manual validation required |
| Self-Documentation | ✅ Schema as documentation | ❌ Requires external docs |
| Auto-Generation | ✅ Generate schemas from data | ❌ Manual schema creation |
Testing
The library includes comprehensive tests covering:
- Basic SchemaColumn functionality and type casting
- BaseSchema validation and column management
- SchemaDataFrame operations and pandas compatibility
- Mathematical operations and filtering with schema columns
- Column access resolution and multi-column selection
Run tests with:
python -m pytest tests/
🔧 Requirements
- Python: 3.8+ (3.9+ recommended)
- pandas: >= 2.0.0
- numpy: >= 1.24.0
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up the development environment
- Code style guidelines
- Testing requirements
- Pull request process
🙏 Acknowledgments
- Built on top of the amazing pandas library
- Inspired by Entity Framework's code-first approach
- Thanks to all contributors
⭐ Star this repo if PandasSchemaster helps you write better, safer pandas code!
🔗 Share with your data science team and help them discover type-safe DataFrames!
Use df[MySchema.COLUMN] for type-safe DataFrame operations! 🚀
Made with ❤️ by @gzocche
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandasschemaster-1.0.1.tar.gz.
File metadata
- Download URL: pandasschemaster-1.0.1.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e6b4ba94ca51e0ce4db0d23d78fc8f95adb4ff98cd73089d11aa19a08937ae3
|
|
| MD5 |
7781fea8da184f07a727dacbb261e7d5
|
|
| BLAKE2b-256 |
9929a6207b55347e2fe5405dbf8d1cb067e87cd3d592a6bd1ed178ee9521dd88
|
Provenance
The following attestation bundles were made for pandasschemaster-1.0.1.tar.gz:
Publisher:
python-publish.yml on gzocche/PandasSchemaster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandasschemaster-1.0.1.tar.gz -
Subject digest:
3e6b4ba94ca51e0ce4db0d23d78fc8f95adb4ff98cd73089d11aa19a08937ae3 - Sigstore transparency entry: 256369099
- Sigstore integration time:
-
Permalink:
gzocche/PandasSchemaster@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/gzocche
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pandasschemaster-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pandasschemaster-1.0.1-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
022ad01a51e1b9b4875bfd7a1da2accd608b78a633e817b17fa55edfff7ddfbc
|
|
| MD5 |
e16c1a41eea61051f2e48b97f21b8951
|
|
| BLAKE2b-256 |
c741d645f98b33a87d21d73d937688bd722f9206df0223c6d421ed6e40cc1a4a
|
Provenance
The following attestation bundles were made for pandasschemaster-1.0.1-py3-none-any.whl:
Publisher:
python-publish.yml on gzocche/PandasSchemaster
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandasschemaster-1.0.1-py3-none-any.whl -
Subject digest:
022ad01a51e1b9b4875bfd7a1da2accd608b78a633e817b17fa55edfff7ddfbc - Sigstore transparency entry: 256369104
- Sigstore integration time:
-
Permalink:
gzocche/PandasSchemaster@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/gzocche
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3 -
Trigger Event:
release
-
Statement type: