Skip to main content

A Python project

Project description

PdSchema

A Python library for schema validation and type inference of pandas DataFrames using PyArrow types.

Features

  • Schema validation for pandas DataFrames
  • Type inference from pandas Series to PyArrow types
  • Rich set of built-in validators
  • Support for both Python types and pandas dtypes
  • Nullability checks
  • Custom validator support

Installation

pip install pyschema

Quick Start

import pandas as pd
from pyschema import Schema, Column
from pyschema.validators import IsPositive, IsNonEmptyString, Range

# Define your schema
schema = Schema([
    Column("age", int, nullable=False, validators=[IsPositive(), Range(0, 120)]),
    Column("name", str, nullable=False, validators=[IsNonEmptyString()]),
    Column("score", float, validators=[Range(0.0, 100.0)]),
])

# Create a DataFrame
df = pd.DataFrame({
    "age": [25, 30, 35],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [95.5, 88.0, 91.2],
})

# Validate the DataFrame
schema.validate(df)  # Returns True if valid, raises ValueError if invalid

Built-in Validators

PySchema provides a rich set of built-in validators:

from pyschema.validators import (
    IsPositive,
    IsNonEmptyString,
    Max,
    Min,
    Range,
    GreaterThan,
    LessThan,
    Choice,
    Length,
)

# Examples
Column("age", int, validators=[IsPositive(), Max(120)])
Column("score", float, validators=[Range(0.0, 100.0)])
Column("status", str, validators=[Choice(["active", "inactive", "pending"])])
Column("description", str, validators=[Length(min_length=10, max_length=500)])

Type Support

PySchema supports both Python types and pandas dtypes, mapping them to appropriate PyArrow types:

Python Types

  • intpa.int64()
  • floatpa.float64()
  • strpa.string()
  • boolpa.bool_()
  • datetimepa.timestamp("us")
  • datepa.date32()
  • timepa.time64("us")
  • Decimalpa.decimal128(38, 18)

Pandas Dtypes

  • Int64Dtypepa.int64()
  • Float64Dtypepa.float64()
  • StringDtypepa.string()
  • BooleanDtypepa.bool_()
  • DatetimeTZDtypepa.timestamp("us")
  • CategoricalDtypepa.dictionary(pa.int32(), pa.string())

API Reference

Schema

class Schema:
    def __init__(self, columns: list[Column]):
        """Initialize a schema with a list of columns."""
        pass

    def validate(self, df: pd.DataFrame) -> bool:
        """Validate a DataFrame against the schema.

        Returns:
            bool: True if valid

        Raises:
            ValueError: If validation fails, with detailed error messages
        """
        pass

Column

class Column:
    def __init__(
        self,
        name: str,
        dtype: type,
        nullable: bool = True,
        validators: list[Validator] | None = None,
    ):
        """Initialize a column definition.

        Args:
            name: Column name
            dtype: Python type or pandas dtype
            nullable: Whether the column can contain null values
            validators: List of validators to apply
        """
        pass

Validators

All validators inherit from the Validator abstract base class and implement the validate method:

class Validator(ABC):
    @abstractmethod
    def validate(self, value) -> bool:
        """Return True if value is valid, else False."""
        pass

Development

  1. Install Poetry (if not already installed):

    curl -sSL https://install.python-poetry.org | python3 -
    
  2. Install dependencies:

    poetry install
    
  3. Install pre-commit hooks:

    poetry run pre-commit install
    
  4. Run tests:

    poetry run pytest
    

License

[Your chosen license]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdschema-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdschema-0.1.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file pdschema-0.1.0.tar.gz.

File metadata

  • Download URL: pdschema-0.1.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.6.87.1-microsoft-standard-WSL2

File hashes

Hashes for pdschema-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fda2fc0b53735794ff9d16356376cf1ee6e662824fdabb048dd83c84a9c09fcb
MD5 854318c486059113239ae88425bd58f7
BLAKE2b-256 f2ba0d0a4ce7f4702a79073bc1bfae16c92528a522277e890f08064798a60914

See more details on using hashes here.

File details

Details for the file pdschema-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdschema-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.6.87.1-microsoft-standard-WSL2

File hashes

Hashes for pdschema-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0bca54f930ea5367eedd282a6defde4b056124ef9eb2f674c013e9bc0c77d6a
MD5 a0a28d889efbbd6387d258c209f4df41
BLAKE2b-256 f9c43ac5ce40849d539e5d984b4e283102445d381c1bd019354f4bcdbf35a56c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page