Skip to main content

Type annotation system that allows you to specify and validate the schema of PySpark DataFrames using Python type hints for both function arguments and return values.

Project description

:rocket: sparkenforce

sparkenforce is a type annotation system that lets you specify and validate PySpark DataFrame schemas using Python type hints. It validates both function arguments and return values, catching schema mismatches before they cause runtime errors.

Why sparkenforce?

Working with PySpark DataFrames can be error-prone when schemas don't match expectations. sparkenforce helps by:

  • Preventing runtime errors: Catch schema mismatches early with type validation
  • Improving code clarity: Function signatures show exactly what DataFrame structure is expected
  • Enforcing contracts: Ensure functions return DataFrames with the promised schema
  • Better debugging: Clear error messages when validations fail

Quick Start

Validating Input DataFrames

from sparkenforce import validate, Dataset
from pyspark.sql import functions as fn

@validate
def add_length(df: Dataset['firstname':str, ...]) -> Dataset['name':str, 'length':int]:
    return df.select(
        df.firstname.alias('name'),
        fn.length(df.firstname).alias('length')
    )

# If input DataFrame doesn't have 'firstname' column, validation fails
# If return DataFrame doesn't match expected schema, validation fails

Flexible Schemas with Ellipsis

Use ... to allow additional columns beyond the specified ones:

@validate
def filter_names(df: Dataset['firstname':str, 'lastname':str, ...]):
    """Requires firstname and lastname, but allows other columns too."""
    return df.filter(df.firstname != "")

Return Value Validation

sparkenforce validates that your function returns exactly what you promise:

@validate
def get_summary(df: Dataset['firstname':str, ...]) -> Dataset['firstname':str, 'summary':str]:
    return df.select(
        'firstname',
        fn.lit('processed').alias('summary'),
    )

Error Handling

When validation fails, sparkenforce provides clear error messages:

# This will raise DatasetValidationError with detailed message:
# "return value columns mismatch. Expected exactly {'name', 'length'},
#  got {'lastname', 'firstname'}. missing columns: {'name', 'length'},
#  unexpected columns: {'lastname', 'firstname'}"

@validate
def bad_function(df: Dataset['firstname':str, ...]) -> Dataset['name':str, 'length':int]:
    return df.select('firstname', 'lastname')  # Wrong columns!

Installation

Install sparkenforce using pip:

pip install sparkenforce

Or if you're using uv:

uv add sparkenforce

Development Setup

Step 1: Create virtual environment

uv venv

Step 2: Activate environment

# Linux/Mac
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Step 3: Install dependencies

uv sync

CLI Commands

# Run tests
task tests

# Type checking
task type

# Linting
task lint

# Format code
task format

# Coverage report
task coverage

Inspiration

This project builds on dataenforce, extending it with additional validation capabilities for PySpark DataFrame workflows.

License

Apache Software License v2.0

Contact

Created by Agustín Recoba

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkenforce-0.2.1.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkenforce-0.2.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file sparkenforce-0.2.1.tar.gz.

File metadata

  • Download URL: sparkenforce-0.2.1.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.24

File hashes

Hashes for sparkenforce-0.2.1.tar.gz
Algorithm Hash digest
SHA256 24e1ec751168f3db9a179946ae9420d7377a2f9a0adff7dc81891406937c2564
MD5 e629f737416d8a7ab42b1dace687f014
BLAKE2b-256 ceabd8130171daa72429b1c7660df88e4ff1e8cf37b4405027f7a3b64f8bf4a5

See more details on using hashes here.

File details

Details for the file sparkenforce-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sparkenforce-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 21ab2d4397f8009ec91ad0baba064ed619cc3664b8fdb0ae5b9213bec7b0d678
MD5 83e343dbf44f8417a82674f0e98f0a5f
BLAKE2b-256 4e33efaff906558252e54c03dd584a296a489fb331bfb072f7ede10f2f3cfe28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page