Skip to main content

Type-safe DataFrame library with schema validation for pandas

Project description

PandasSchemaster

Type-safe DataFrame library with schema validation for pandas.

Python 3.8+ License: MIT

Overview

PandasSchemaster provides a strongly-typed interface to pandas DataFrames with automatic validation, type conversion, and schema-based column access. Use df[MySchema.COLUMN] instead of df['column'] for type-safe, IDE-friendly DataFrame operations that inherit all pandas DataFrame functionality.

Key Features

  • 🛡️ Type Safety: Schema-based column access prevents runtime errors
  • 🔧 IDE Support: Autocompletion and error detection for column names
  • Validation: Automatic data validation based on schema definitions
  • 🔄 Auto-casting: Seamless data type conversions
  • Full DataFrame Compatibility: Inherits from pandas.DataFrame - all methods work
  • �📖 Self-documenting: Clear, readable code with schema column references

Quick Start

Installation

pip install pandasschemaster

Basic Usage

import pandas as pd
import numpy as np
from pandasschemaster import SchemaColumn, SchemaDataFrame, BaseSchema

# Define your schema
class SensorSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    HUMIDITY = SchemaColumn("humidity", np.float64)
    SENSOR_ID = SchemaColumn("sensor_id", np.int64, nullable=False)

# Create data
data = {
    'timestamp': [pd.Timestamp.now()],
    'temperature': [23.5],
    'humidity': [45.2],
    'sensor_id': [1001]
}

# Create validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema, validate=True, auto_cast=True)

# Use schema columns for type-safe operations
temperature = df[SensorSchema.TEMPERATURE]  # Instead of df['temperature']
fahrenheit = df[SensorSchema.TEMPERATURE] * 9/5 + 32
hot_readings = df[df[SensorSchema.TEMPERATURE] > 25]

# Multi-column selection
subset = df[[SensorSchema.TEMPERATURE, SensorSchema.HUMIDITY]]

# Assignment with automatic type casting
df[SensorSchema.TEMPERATURE] = [24.1]

Schema Column Benefits

✅ Type-Safe Access

# Type-safe schema column access
temperature = df[SensorSchema.TEMPERATURE]

# vs traditional string access (error-prone)
temperature = df['temperature']  # Typos not caught until runtime

🔧 IDE Support

  • Autocompletion: SensorSchema. shows available columns
  • Error Detection: Invalid column names highlighted
  • Go-to-Definition: Jump to schema definition

🔄 Refactoring Safety

# Rename a schema column and all references update automatically
class SensorSchema(BaseSchema):
    TEMP_CELSIUS = SchemaColumn("temperature_celsius", np.float64)  # Renamed
    # All df[SensorSchema.TEMP_CELSIUS] references work immediately

🐼 Full DataFrame Compatibility

SchemaDataFrame inherits directly from pandas.DataFrame, so all DataFrame methods work seamlessly:

# Create schema-validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema)

# Use all pandas DataFrame methods directly
print(df.shape)  # (100, 4)
print(df.head())  # First 5 rows
summary = df.describe()  # Statistical summary
grouped = df.groupby(SensorSchema.SENSOR_ID.name).mean()

# Mathematical operations
df_scaled = df * 2
df_filtered = df[df[SensorSchema.TEMPERATURE] > 25]

# All pandas operations work while maintaining schema validation

Advanced Features

Schema Column Types and Validation

class AdvancedSchema(BaseSchema):
    # Basic column with nullable control
    PRESSURE = SchemaColumn("pressure", np.float64, nullable=False)
    
    # Column with default value
    STATUS = SchemaColumn("status", np.dtype('object'), 
                         default="UNKNOWN", nullable=True)
    
    # Column with description
    MACHINE_ID = SchemaColumn("machine_id", np.int64, 
                             description="Unique machine identifier")

Data Type Casting and Conversion

# Auto-casting handles string to numeric conversion
data = {
    'temperature': ["23.5", "24.1"],  # String values
    'sensor_id': ["1001", "1002"]     # String values  
}

df = SchemaDataFrame(data, schema_class=SensorSchema, 
                    validate=True, auto_cast=True)

# Values are automatically cast to schema types
print(df.dtypes)
# temperature    float64
# sensor_id      Int64

Real-World Example

# Industrial IoT sensor data processing
class IndustrialSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    MACHINE_ID = SchemaColumn("machine_id", np.int64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    PRESSURE = SchemaColumn("pressure", np.float64)
    STATUS = SchemaColumn("status", np.dtype('object'))

# Load and validate data
df = SchemaDataFrame(sensor_data, schema_class=IndustrialSchema, validate=True)

# Type-safe analysis using schema columns
avg_temp_by_machine = df.groupby(IndustrialSchema.MACHINE_ID.name)[
    IndustrialSchema.TEMPERATURE.name
].mean()

overheating = df[df[IndustrialSchema.TEMPERATURE] > 150]
efficiency = df[IndustrialSchema.PRESSURE] / df[IndustrialSchema.TEMPERATURE]

# Filter by status using schema column
running_machines = df[df[IndustrialSchema.STATUS] == 'RUNNING']

# Complex multi-column operations
subset = df.select_columns([IndustrialSchema.TEMPERATURE, IndustrialSchema.PRESSURE])

Key Features Demonstrated in Tests

Column Resolution and Access

# The library handles both string and SchemaColumn access
temp1 = df['temperature']                    # Traditional string access
temp2 = df[SensorSchema.TEMPERATURE]         # Schema column access
assert temp1.equals(temp2)                   # Both work identically

# Multi-column selection with mixed types
subset = df[[SensorSchema.TEMPERATURE, 'humidity']]  # Mixed access works

Schema Validation

# Validation catches missing required columns
class StrictSchema(BaseSchema):
    REQUIRED_COL = SchemaColumn("required", np.float64, nullable=False)

# This will raise validation errors
errors = StrictSchema.validate_dataframe(incomplete_df)
print(errors)  # ['Required column required is missing']

Mathematical Operations

# All mathematical operations work with schema columns
celsius = df[SensorSchema.TEMPERATURE]
fahrenheit = celsius * 9/5 + 32
hot_mask = celsius > 25
comfort_index = celsius + df[SensorSchema.HUMIDITY] / 10

Core Components

SchemaColumn

Defines a typed column with validation and transformation capabilities.

# Basic column definition
temp_col = SchemaColumn("temperature", np.float64, nullable=True)

# Column with all options
advanced_col = SchemaColumn(
    name="pressure",
    dtype=np.float64,
    nullable=False,
    default=0.0,
    description="Atmospheric pressure in hPa"
)

BaseSchema

Abstract base class for defining DataFrame schemas with class methods for validation.

class MySchema(BaseSchema):
    COL1 = SchemaColumn("col1", np.float64)
    COL2 = SchemaColumn("col2", np.int64)

# Get schema information
columns = MySchema.get_columns()          # Dict of column definitions
names = MySchema.get_column_names()       # List of column names
errors = MySchema.validate_dataframe(df)  # Validation error list

SchemaDataFrame

Pandas DataFrame wrapper with schema validation and type-safe column access.

# All pandas DataFrame methods work
df = SchemaDataFrame(data, schema_class=MySchema)
print(df.shape)                    # Shape
print(df.head())                   # First rows
summary = df.describe()            # Statistics
filtered = df[df['col1'] > 5]      # Filtering

# Plus schema-specific features
subset = df.select_columns([MySchema.COL1])  # Schema-based selection
print(df.schema)                             # Access to schema class

Requirements

  • Python 3.8+
  • pandas >= 2.0.0
  • numpy >= 1.24.0

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! Please read our contributing guidelines and submit pull requests.

Support

  • 🐛 Issues: GitHub Issues
  • 💡 Questions: Use GitHub Discussions

Testing

The library includes comprehensive tests covering:

  • Basic SchemaColumn functionality and type casting
  • BaseSchema validation and column management
  • SchemaDataFrame operations and pandas compatibility
  • Mathematical operations and filtering with schema columns
  • Column access resolution and multi-column selection

Run tests with:

python -m pytest tests/

Use df[MySchema.COLUMN] for type-safe DataFrame operations! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasschemaster-1.0.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandasschemaster-1.0.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file pandasschemaster-1.0.0.tar.gz.

File metadata

  • Download URL: pandasschemaster-1.0.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pandasschemaster-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fed026207c350e76d1454e3667872f0f23a93681da7f0d99c0e710383c0a0044
MD5 2a6b3693524aaae379bf201f1f99ddf1
BLAKE2b-256 aa9890fc24f52184c9c47226e6f20d80ec7f66f1378c7ac0f62e8e5412a8afd7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.0.tar.gz:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pandasschemaster-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pandasschemaster-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 67d2df667e26ebd3b4e272d9cd2a6c721a0753d86b7ff6bdc2e3ab1b2f85c541
MD5 1f591f482d726d191a467a9a2fd0d134
BLAKE2b-256 1175d43303ef0b12113a9c2a29e65cf5c17da7e91f20dd04602ef7e6e8944f14

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page