Type-safe DataFrame library with schema validation for pandas

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gzocche

These details have not been verified by PyPI

Project description

🐼 PandasSchemaster

Type-safe DataFrame operations with schema validation for pandas

Transform your pandas DataFrames from df['column'] to df[Schema.COLUMN] for bulletproof, IDE-friendly data operations!

🎯 Why PandasSchemaster?

Before: Error-prone string-based column access

df['temprature']  # Typo - runtime error! 😱
df['temperatuur']  # Wrong column name - silent failure! 💥

After: Type-safe schema-based column access

df[SensorSchema.TEMPERATURE]  # IDE autocomplete + compile-time checking! ✨

✨ Key Features

🛡️ Type Safety: Schema-based column access prevents runtime errors
🔧 IDE Support: Full autocompletion and error detection for column names
✅ Validation: Automatic data validation based on schema definitions
🔄 Auto-casting: Seamless data type conversions
� Full DataFrame Compatibility: Inherits from pandas.DataFrame - all methods work
�📖 Self-documenting: Clear, readable code with schema column references

Quick Start

Installation

pip install pandasschemaster

Basic Usage

import pandas as pd
import numpy as np
from pandasschemaster import SchemaColumn, SchemaDataFrame, BaseSchema

# Define your schema
class SensorSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    HUMIDITY = SchemaColumn("humidity", np.float64)
    SENSOR_ID = SchemaColumn("sensor_id", np.int64, nullable=False)

# Create data
data = {
    'timestamp': [pd.Timestamp.now()],
    'temperature': [23.5],
    'humidity': [45.2],
    'sensor_id': [1001]
}

# Create validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema, validate=True, auto_cast=True)

# Use schema columns for type-safe operations
temperature = df[SensorSchema.TEMPERATURE]  # Instead of df['temperature']
fahrenheit = df[SensorSchema.TEMPERATURE] * 9/5 + 32
hot_readings = df[df[SensorSchema.TEMPERATURE] > 25]

# Multi-column selection
subset = df[[SensorSchema.TEMPERATURE, SensorSchema.HUMIDITY]]

# Assignment with automatic type casting
df[SensorSchema.TEMPERATURE] = [24.1]

Command-Line Schema Generator

PandasSchemaster includes a powerful CLI tool to automatically generate schema classes from your data files:

# Generate schema from CSV and print to console
python Scripts/generate_schema.py data.csv

# Save schema to file with custom class name
python Scripts/generate_schema.py data.csv -o my_schema.py -c CustomerSchema

# Sample large files for faster processing
python Scripts/generate_schema.py large_data.csv -s 1000 -v

# On Windows, you can also use the batch file
Scripts\generate_schema.bat data.csv -o schema.py -c MySchema

# On Unix/Linux, you can use the shell script
./Scripts/generate_schema.sh data.csv -o schema.py -c MySchema

Supported File Formats

CSV (.csv) - Comma-separated values
Excel (.xlsx, .xls) - Microsoft Excel files
JSON (.json) - JavaScript Object Notation
Parquet (.parquet) - Apache Parquet format
TSV/TXT (.tsv, .txt) - Tab-separated values

CLI Options

Option	Description	Example
`input_file`	Path to data file (required)	`data.csv`
`-o, --output`	Output file for schema	`-o schema.py`
`-c, --class-name`	Custom class name	`-c CustomerSchema`
`-s, --sample-size`	Number of rows to analyze	`-s 1000`
`--no-nullable`	Disable nullable inference	`--no-nullable`
`-v, --verbose`	Enable detailed logging	`-v`

The generator automatically detects data types (numeric, boolean, datetime, string) and creates properly typed schema classes. For detailed usage examples, see CLI_USAGE.md.

Schema Column Benefits

✅ Type-Safe Access

# Type-safe schema column access
temperature = df[SensorSchema.TEMPERATURE]

# vs traditional string access (error-prone)
temperature = df['temperature']  # Typos not caught until runtime

🔧 IDE Support

Autocompletion: SensorSchema. shows available columns
Error Detection: Invalid column names highlighted
Go-to-Definition: Jump to schema definition

🔄 Refactoring Safety

# Rename a schema column and all references update automatically
class SensorSchema(BaseSchema):
    TEMP_CELSIUS = SchemaColumn("temperature_celsius", np.float64)  # Renamed
    # All df[SensorSchema.TEMP_CELSIUS] references work immediately

🐼 Full DataFrame Compatibility

SchemaDataFrame inherits directly from pandas.DataFrame, so all DataFrame methods work seamlessly:

# Create schema-validated DataFrame
df = SchemaDataFrame(data, schema_class=SensorSchema)

# Use all pandas DataFrame methods directly
print(df.shape)  # (100, 4)
print(df.head())  # First 5 rows
summary = df.describe()  # Statistical summary
grouped = df.groupby(SensorSchema.SENSOR_ID.name).mean()

# Mathematical operations
df_scaled = df * 2
df_filtered = df[df[SensorSchema.TEMPERATURE] > 25]

# All pandas operations work while maintaining schema validation

Advanced Features

Schema Column Types and Validation

class AdvancedSchema(BaseSchema):
    # Basic column with nullable control
    PRESSURE = SchemaColumn("pressure", np.float64, nullable=False)
    
    # Column with default value
    STATUS = SchemaColumn("status", np.dtype('object'), 
                         default="UNKNOWN", nullable=True)
    
    # Column with description
    MACHINE_ID = SchemaColumn("machine_id", np.int64, 
                             description="Unique machine identifier")

Data Type Casting and Conversion

# Auto-casting handles string to numeric conversion
data = {
    'temperature': ["23.5", "24.1"],  # String values
    'sensor_id': ["1001", "1002"]     # String values  
}

df = SchemaDataFrame(data, schema_class=SensorSchema, 
                    validate=True, auto_cast=True)

# Values are automatically cast to schema types
print(df.dtypes)
# temperature    float64
# sensor_id      Int64

Real-World Example

# Industrial IoT sensor data processing
class IndustrialSchema(BaseSchema):
    TIMESTAMP = SchemaColumn("timestamp", np.datetime64, nullable=False)
    MACHINE_ID = SchemaColumn("machine_id", np.int64, nullable=False)
    TEMPERATURE = SchemaColumn("temperature", np.float64)
    PRESSURE = SchemaColumn("pressure", np.float64)
    STATUS = SchemaColumn("status", np.dtype('object'))

# Load and validate data
df = SchemaDataFrame(sensor_data, schema_class=IndustrialSchema, validate=True)

# Type-safe analysis using schema columns
avg_temp_by_machine = df.groupby(IndustrialSchema.MACHINE_ID.name)[
    IndustrialSchema.TEMPERATURE.name
].mean()

overheating = df[df[IndustrialSchema.TEMPERATURE] > 150]
efficiency = df[IndustrialSchema.PRESSURE] / df[IndustrialSchema.TEMPERATURE]

# Filter by status using schema column
running_machines = df[df[IndustrialSchema.STATUS] == 'RUNNING']

# Complex multi-column operations
subset = df.select_columns([IndustrialSchema.TEMPERATURE, IndustrialSchema.PRESSURE])

Key Features Demonstrated in Tests

Column Resolution and Access

# The library handles both string and SchemaColumn access
temp1 = df['temperature']                    # Traditional string access
temp2 = df[SensorSchema.TEMPERATURE]         # Schema column access
assert temp1.equals(temp2)                   # Both work identically

# Multi-column selection with mixed types
subset = df[[SensorSchema.TEMPERATURE, 'humidity']]  # Mixed access works

Schema Validation

# Validation catches missing required columns
class StrictSchema(BaseSchema):
    REQUIRED_COL = SchemaColumn("required", np.float64, nullable=False)

# This will raise validation errors
errors = StrictSchema.validate_dataframe(incomplete_df)
print(errors)  # ['Required column required is missing']

Mathematical Operations

# All mathematical operations work with schema columns
celsius = df[SensorSchema.TEMPERATURE]
fahrenheit = celsius * 9/5 + 32
hot_mask = celsius > 25
comfort_index = celsius + df[SensorSchema.HUMIDITY] / 10

Core Components

SchemaColumn

Defines a typed column with validation and transformation capabilities.

# Basic column definition
temp_col = SchemaColumn("temperature", np.float64, nullable=True)

# Column with all options
advanced_col = SchemaColumn(
    name="pressure",
    dtype=np.float64,
    nullable=False,
    default=0.0,
    description="Atmospheric pressure in hPa"
)

BaseSchema

Abstract base class for defining DataFrame schemas with class methods for validation.

class MySchema(BaseSchema):
    COL1 = SchemaColumn("col1", np.float64)
    COL2 = SchemaColumn("col2", np.int64)

# Get schema information
columns = MySchema.get_columns()          # Dict of column definitions
names = MySchema.get_column_names()       # List of column names
errors = MySchema.validate_dataframe(df)  # Validation error list

SchemaDataFrame

Pandas DataFrame wrapper with schema validation and type-safe column access.

# All pandas DataFrame methods work
df = SchemaDataFrame(data, schema_class=MySchema)
print(df.shape)                    # Shape
print(df.head())                   # First rows
summary = df.describe()            # Statistics
filtered = df[df['col1'] > 5]      # Filtering

# Plus schema-specific features
subset = df.select_columns([MySchema.COL1])  # Schema-based selection
print(df.schema)                             # Access to schema class

📚 Documentation

Document	Description
🚀 Quick Start	Get started in 30 seconds
📖 API Reference	Complete API documentation
🔧 CLI Usage Guide	Command-line tool documentation
🎯 Examples & Tutorials	Real-world examples and patterns
🤝 Contributing	How to contribute
📋 Changelog	Version history

📞 Support & Community

🐛 Bug Reports: GitHub Issues
💡 Feature Requests: GitHub Issues
💬 Questions: GitHub Discussions
� Email: your.email@example.com

🏆 Why Choose PandasSchemaster?

Feature	PandasSchemaster	Regular Pandas
Type Safety	✅ Compile-time column checking	❌ Runtime string errors
IDE Support	✅ Full autocompletion	❌ No column suggestions
Refactoring	✅ Safe column renaming	❌ Manual find-replace
Validation	✅ Automatic data validation	❌ Manual validation required
Self-Documentation	✅ Schema as documentation	❌ Requires external docs
Auto-Generation	✅ Generate schemas from data	❌ Manual schema creation

Testing

The library includes comprehensive tests covering:

Basic SchemaColumn functionality and type casting
BaseSchema validation and column management
SchemaDataFrame operations and pandas compatibility
Mathematical operations and filtering with schema columns
Column access resolution and multi-column selection

Run tests with:

python -m pytest tests/

🔧 Requirements

Python: 3.8+ (3.9+ recommended)
pandas: >= 2.0.0
numpy: >= 1.24.0

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Setting up the development environment
Code style guidelines
Testing requirements
Pull request process

🙏 Acknowledgments

Built on top of the amazing pandas library
Inspired by Entity Framework's code-first approach
Thanks to all contributors

⭐ Star this repo if PandasSchemaster helps you write better, safer pandas code!

🔗 Share with your data science team and help them discover type-safe DataFrames!

Use df[MySchema.COLUMN] for type-safe DataFrame operations! 🚀

Made with ❤️ by @gzocche

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gzocche

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.2

Jul 1, 2025

This version

1.0.1

Jun 30, 2025

1.0.0

Jun 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasschemaster-1.0.1.tar.gz (55.0 kB view details)

Uploaded Jun 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandasschemaster-1.0.1-py3-none-any.whl (16.6 kB view details)

Uploaded Jun 30, 2025 Python 3

File details

Details for the file pandasschemaster-1.0.1.tar.gz.

File metadata

Download URL: pandasschemaster-1.0.1.tar.gz
Upload date: Jun 30, 2025
Size: 55.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pandasschemaster-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`3e6b4ba94ca51e0ce4db0d23d78fc8f95adb4ff98cd73089d11aa19a08937ae3`
MD5	`7781fea8da184f07a727dacbb261e7d5`
BLAKE2b-256	`9929a6207b55347e2fe5405dbf8d1cb067e87cd3d592a6bd1ed178ee9521dd88`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.1.tar.gz:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pandasschemaster-1.0.1.tar.gz
- Subject digest: 3e6b4ba94ca51e0ce4db0d23d78fc8f95adb4ff98cd73089d11aa19a08937ae3
- Sigstore transparency entry: 256369099
- Sigstore integration time: Jun 30, 2025
Source repository:
- Permalink: gzocche/PandasSchemaster@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/gzocche
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3
- Trigger Event: release

File details

Details for the file pandasschemaster-1.0.1-py3-none-any.whl.

File metadata

Download URL: pandasschemaster-1.0.1-py3-none-any.whl
Upload date: Jun 30, 2025
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pandasschemaster-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`022ad01a51e1b9b4875bfd7a1da2accd608b78a633e817b17fa55edfff7ddfbc`
MD5	`e16c1a41eea61051f2e48b97f21b8951`
BLAKE2b-256	`c741d645f98b33a87d21d73d937688bd722f9206df0223c6d421ed6e40cc1a4a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandasschemaster-1.0.1-py3-none-any.whl:

Publisher: python-publish.yml on gzocche/PandasSchemaster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pandasschemaster-1.0.1-py3-none-any.whl
- Subject digest: 022ad01a51e1b9b4875bfd7a1da2accd608b78a633e817b17fa55edfff7ddfbc
- Sigstore transparency entry: 256369104
- Sigstore integration time: Jun 30, 2025
Source repository:
- Permalink: gzocche/PandasSchemaster@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/gzocche
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@26d8a3ed7d18643a5a33cbac9ae31a69f3b471a3
- Trigger Event: release

pandasschemaster 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🐼 PandasSchemaster

🎯 Why PandasSchemaster?

✨ Key Features

Quick Start

Installation

Basic Usage

Command-Line Schema Generator

Supported File Formats

CLI Options

Schema Column Benefits

✅ Type-Safe Access

🔧 IDE Support

🔄 Refactoring Safety

🐼 Full DataFrame Compatibility

Advanced Features

Schema Column Types and Validation

Data Type Casting and Conversion

Real-World Example

Key Features Demonstrated in Tests

Column Resolution and Access

Schema Validation

Mathematical Operations

Core Components

SchemaColumn

BaseSchema

SchemaDataFrame

📚 Documentation

📞 Support & Community

🏆 Why Choose PandasSchemaster?

Testing

🔧 Requirements

📄 License

🤝 Contributing

🙏 Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance