Skip to main content

Type-safe DataFrame library with schema validation for pandas

Project description

🐼 PandasSchemaster

Type-safe DataFrame operations with schema validation for pandas

Python 3.8+ License: MIT Build Status Coverage PyPI

Transform your pandas DataFrames from df['column'] to df[Schema.COLUMN] for bulletproof, IDE-friendly data operations!

🎯 Why PandasSchemaster?

Before: Error-prone string-based column access

df['temprature']  # Typo - runtime error! 😱
df['temperatuur']  # Wrong column name - silent failure! 💥

After: Type-safe schema-based column access

df[SensorSchema.TEMPERATURE]  # IDE autocomplete + compile-time checking! ✨

✨ Key Features

  • 🛡️ Type Safety: Schema-based column access prevents runtime errors
  • 🔧 IDE Support: Full autocompletion and error detection for column names
  • Validation: Automatic data validation based on schema definitions
  • 🔄 Auto-casting: Seamless data type conversions
  • Full DataFrame Compatibility: Inherits from pandas.DataFrame - all methods work
  • �📖 Self-documenting: Clear, readable code with schema column references

Quick Start

Installation

pip install pandasschemaster

Basic Usage

import pandas as pd
import numpy as np
from pandasschemaster import SchemaColumn, SchemaDataFrame, BaseSchema

# Define your schema
class SensorSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    HUMIDITY = SchemaColumn("humidity", np.float64)
    SENSOR_ID = SchemaColumn("sensor_id", np.int64, nullable=False)

# Create data
data = {
    'timestamp': [pd.Timestamp.now()],
    'temperature': [23.5],
    'humidity': [45.2],
    'sensor_id': [1001]
}

# Create validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema, validate=True, auto_cast=True)

# Use schema columns for type-safe operations
temperature = df[SensorSchema.TEMPERATURE]  # Instead of df['temperature']
fahrenheit = df[SensorSchema.TEMPERATURE] * 9/5 + 32
hot_readings = df[df[SensorSchema.TEMPERATURE] > 25]

# Multi-column selection
subset = df[[SensorSchema.TEMPERATURE, SensorSchema.HUMIDITY]]

# Assignment with automatic type casting
df[SensorSchema.TEMPERATURE] = [24.1]

Command-Line Schema Generator

PandasSchemaster includes a powerful CLI tool to automatically generate schema classes from your data files:

# Generate schema from CSV and print to console
python Scripts/generate_schema.py data.csv

# Save schema to file with custom class name
python Scripts/generate_schema.py data.csv -o my_schema.py -c CustomerSchema

# Sample large files for faster processing
python Scripts/generate_schema.py large_data.csv -s 1000 -v

# On Windows, you can also use the batch file
Scripts\generate_schema.bat data.csv -o schema.py -c MySchema

# On Unix/Linux, you can use the shell script
./Scripts/generate_schema.sh data.csv -o schema.py -c MySchema

Supported File Formats

  • CSV (.csv) - Comma-separated values
  • Excel (.xlsx, .xls) - Microsoft Excel files
  • JSON (.json) - JavaScript Object Notation
  • Parquet (.parquet) - Apache Parquet format
  • TSV/TXT (.tsv, .txt) - Tab-separated values

CLI Options

Option Description Example
input_file Path to data file (required) data.csv
-o, --output Output file for schema -o schema.py
-c, --class-name Custom class name -c CustomerSchema
-s, --sample-size Number of rows to analyze -s 1000
--no-nullable Disable nullable inference --no-nullable
-v, --verbose Enable detailed logging -v

The generator automatically detects data types (numeric, boolean, datetime, string) and creates properly typed schema classes. For detailed usage examples, see CLI_USAGE.md.

Schema Column Benefits

✅ Type-Safe Access

# Type-safe schema column access
temperature = df[SensorSchema.TEMPERATURE]

# vs traditional string access (error-prone)
temperature = df['temperature']  # Typos not caught until runtime

🔧 IDE Support

  • Autocompletion: SensorSchema. shows available columns
  • Error Detection: Invalid column names highlighted
  • Go-to-Definition: Jump to schema definition

🔄 Refactoring Safety

# Rename a schema column and all references update automatically
class SensorSchema(BaseSchema):
    TEMP_CELSIUS = SchemaColumn("temperature_celsius", np.float64)  # Renamed
    # All df[SensorSchema.TEMP_CELSIUS] references work immediately

🐼 Full DataFrame Compatibility

SchemaDataFrame inherits directly from pandas.DataFrame, so all DataFrame methods work seamlessly:

# Create schema-validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema)

# Use all pandas DataFrame methods directly
print(df.shape)  # (100, 4)
print(df.head())  # First 5 rows
summary = df.describe()  # Statistical summary
grouped = df.groupby(SensorSchema.SENSOR_ID.name).mean()

# Mathematical operations
df_scaled = df * 2
df_filtered = df[df[SensorSchema.TEMPERATURE] > 25]

# All pandas operations work while maintaining schema validation

Advanced Features

Schema Column Types and Validation

class AdvancedSchema(BaseSchema):
    # Basic column with nullable control
    PRESSURE = SchemaColumn("pressure", np.float64, nullable=False)
    
    # Column with default value
    STATUS = SchemaColumn("status", np.dtype('object'), 
                         default="UNKNOWN", nullable=True)
    
    # Column with description
    MACHINE_ID = SchemaColumn("machine_id", np.int64, 
                             description="Unique machine identifier")

Data Type Casting and Conversion

# Auto-casting handles string to numeric conversion
data = {
    'temperature': ["23.5", "24.1"],  # String values
    'sensor_id': ["1001", "1002"]     # String values  
}

df = SchemaDataFrame(data, schema_class=SensorSchema, 
                    validate=True, auto_cast=True)

# Values are automatically cast to schema types
print(df.dtypes)
# temperature    float64
# sensor_id      Int64

Real-World Example

# Industrial IoT sensor data processing
class IndustrialSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    MACHINE_ID = SchemaColumn("machine_id", np.int64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    PRESSURE = SchemaColumn("pressure", np.float64)
    STATUS = SchemaColumn("status", np.dtype('object'))

# Load and validate data
df = SchemaDataFrame(sensor_data, schema_class=IndustrialSchema, validate=True)

# Type-safe analysis using schema columns
avg_temp_by_machine = df.groupby(IndustrialSchema.MACHINE_ID.name)[
    IndustrialSchema.TEMPERATURE.name
].mean()

overheating = df[df[IndustrialSchema.TEMPERATURE] > 150]
efficiency = df[IndustrialSchema.PRESSURE] / df[IndustrialSchema.TEMPERATURE]

# Filter by status using schema column
running_machines = df[df[IndustrialSchema.STATUS] == 'RUNNING']

# Complex multi-column operations
subset = df.select_columns([IndustrialSchema.TEMPERATURE, IndustrialSchema.PRESSURE])

Key Features Demonstrated in Tests

Column Resolution and Access

# The library handles both string and SchemaColumn access
temp1 = df['temperature']                    # Traditional string access
temp2 = df[SensorSchema.TEMPERATURE]         # Schema column access
assert temp1.equals(temp2)                   # Both work identically

# Multi-column selection with mixed types
subset = df[[SensorSchema.TEMPERATURE, 'humidity']]  # Mixed access works

Schema Validation

# Validation catches missing required columns
class StrictSchema(BaseSchema):
    REQUIRED_COL = SchemaColumn("required", np.float64, nullable=False)

# This will raise validation errors
errors = StrictSchema.validate_dataframe(incomplete_df)
print(errors)  # ['Required column required is missing']

Mathematical Operations

# All mathematical operations work with schema columns
celsius = df[SensorSchema.TEMPERATURE]
fahrenheit = celsius * 9/5 + 32
hot_mask = celsius > 25
comfort_index = celsius + df[SensorSchema.HUMIDITY] / 10

Core Components

SchemaColumn

Defines a typed column with validation and transformation capabilities.

# Basic column definition
temp_col = SchemaColumn("temperature", np.float64, nullable=True)

# Column with all options
advanced_col = SchemaColumn(
    name="pressure",
    dtype=np.float64,
    nullable=False,
    default=0.0,
    description="Atmospheric pressure in hPa"
)

BaseSchema

Abstract base class for defining DataFrame schemas with class methods for validation.

class MySchema(BaseSchema):
    COL1 = SchemaColumn("col1", np.float64)
    COL2 = SchemaColumn("col2", np.int64)

# Get schema information
columns = MySchema.get_columns()          # Dict of column definitions
names = MySchema.get_column_names()       # List of column names
errors = MySchema.validate_dataframe(df)  # Validation error list

SchemaDataFrame

Pandas DataFrame wrapper with schema validation and type-safe column access.

# All pandas DataFrame methods work
df = SchemaDataFrame(data, schema_class=MySchema)
print(df.shape)                    # Shape
print(df.head())                   # First rows
summary = df.describe()            # Statistics
filtered = df[df['col1'] > 5]      # Filtering

# Plus schema-specific features
subset = df.select_columns([MySchema.COL1])  # Schema-based selection
print(df.schema)                             # Access to schema class

📚 Documentation

Document Description
🚀 Quick Start Get started in 30 seconds
📖 API Reference Complete API documentation
🔧 CLI Usage Guide Command-line tool documentation
🎯 Examples & Tutorials Real-world examples and patterns
🤝 Contributing How to contribute
📋 Changelog Version history

📞 Support & Community

🏆 Why Choose PandasSchemaster?

Feature PandasSchemaster Regular Pandas
Type Safety ✅ Compile-time column checking ❌ Runtime string errors
IDE Support ✅ Full autocompletion ❌ No column suggestions
Refactoring ✅ Safe column renaming ❌ Manual find-replace
Validation ✅ Automatic data validation ❌ Manual validation required
Self-Documentation ✅ Schema as documentation ❌ Requires external docs
Auto-Generation ✅ Generate schemas from data ❌ Manual schema creation

Testing

The library includes comprehensive tests covering:

  • Basic SchemaColumn functionality and type casting
  • BaseSchema validation and column management
  • SchemaDataFrame operations and pandas compatibility
  • Mathematical operations and filtering with schema columns
  • Column access resolution and multi-column selection

Run tests with:

python -m pytest tests/

🔧 Requirements

  • Python: 3.8+ (3.9+ recommended)
  • pandas: >= 2.0.0
  • numpy: >= 1.24.0

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Setting up the development environment
  • Code style guidelines
  • Testing requirements
  • Pull request process

🙏 Acknowledgments

  • Built on top of the amazing pandas library
  • Inspired by Entity Framework's code-first approach
  • Thanks to all contributors

⭐ Star this repo if PandasSchemaster helps you write better, safer pandas code!

🔗 Share with your data science team and help them discover type-safe DataFrames!


Use df[MySchema.COLUMN] for type-safe DataFrame operations! 🚀

Made with ❤️ by @gzocche

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasschemaster-1.0.1.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandasschemaster-1.0.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file pandasschemaster-1.0.1.tar.gz.

File metadata

  • Download URL: pandasschemaster-1.0.1.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pandasschemaster-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3e6b4ba94ca51e0ce4db0d23d78fc8f95adb4ff98cd73089d11aa19a08937ae3
MD5 7781fea8da184f07a727dacbb261e7d5
BLAKE2b-256 9929a6207b55347e2fe5405dbf8d1cb067e87cd3d592a6bd1ed178ee9521dd88

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.1.tar.gz:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pandasschemaster-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pandasschemaster-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 022ad01a51e1b9b4875bfd7a1da2accd608b78a633e817b17fa55edfff7ddfbc
MD5 e16c1a41eea61051f2e48b97f21b8951
BLAKE2b-256 c741d645f98b33a87d21d73d937688bd722f9206df0223c6d421ed6e40cc1a4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.1-py3-none-any.whl:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page