A PySpark transform registry with MLflow integration.

Project description

PySpark Transform Registry

A simplified library for registering and loading PySpark transform functions using MLflow's model registry.

Installation

pip install pyspark-transform-registry

uv add pyspark-transform-registry

Quick Start

Register a Function

from pyspark_transform_registry import register_transform
from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def clean_data(df: DataFrame) -> DataFrame:
    """Remove invalid records and standardize data."""
    return df.filter(F.col("amount") > 0).withColumn("status", F.lit("clean"))

# Register the transform
logged_model = register_transform(
    func=clean_data,
    name="analytics.etl.clean_data",
    description="Data cleaning transformation"
)

Load and Use a Transform

from pyspark_transform_registry import load_transform, load_transform_uri

# Load the registered transform
clean_data_func = load_transform("analytics.etl.clean_data", version=1)

# Or
clean_data_func = load_transform_uri("transforms:/analytics.etl.clean_data/1")

# Use it on your data
result = clean_data_func(your_dataframe)

Features

Simple API: Just two main functions - register_transform() and load_transform()
Direct Registration: Register transforms directly from Python code
File-based Registration: Load and register transforms from Python files
Automatic Versioning: Integer-based versioning with automatic incrementing
MLflow Integration: Built on MLflow's model registry

Usage Examples

Direct Transform Registration

from pyspark_transform_registry import register_transform
from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def risk_scorer(df: DataFrame, threshold: float = 100.0) -> DataFrame:
    """Calculate risk scores based on amount."""
    return df.withColumn(
        "risk_score",
        F.when(F.col("amount") > threshold, "high").otherwise("low")
    )

# Register with metadata
register_transform(
    func=risk_scorer,
    name="finance.scoring.risk_scorer",
    description="Risk scoring transformation",
    extra_pip_requirements=["numpy>=1.20.0"],
    tags={"team": "finance", "category": "scoring"}
)

File-based Registration

# transforms/data_processors.py
from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def feature_engineer(df: DataFrame) -> DataFrame:
    """Create engineered features."""
    return df.withColumn("feature_1", F.col("amount") * 2)

# Register from file
register_transform(
    file_path="transforms/data_processors.py",
    function_name="feature_engineer",
    name="ml.features.feature_engineer",
    description="Feature engineering pipeline"
)

Source Code Inspection

# Load a transform
transform = load_transform("retail.processing.process_orders", version=1)

# Get the original source code
source_code = transform.get_source()
print(source_code)  # Shows the original function definition

# Get the original function for inspection
original_func = transform.get_original_function()
print(f"Function name: {original_func.__name__}")
print(f"Docstring: {original_func.__doc__}")

Managing Transform Dependencies

Install dependencies for registered transforms automatically:

from pyspark_transform_registry import install_transform_requirements

# Install all dependencies for a transform
install_transform_requirements("transforms:/analytics.etl.clean_data/1")

# Then load the transform (dependencies are now available)
transform = load_transform("analytics.etl.clean_data", version=1)

You can also exclude certain packages (useful when running in environments like Databricks where some packages are pre-installed):

# Install dependencies but exclude packages already available in the environment
install_transform_requirements(
    "transforms:/analytics.etl.clean_data/1",
    exclude_packages=["pyspark", "mlflow", "pandas"]
)

Requirements

Python 3.9+
PySpark 3.0+
MLflow 3.0+

Development

# Install development dependencies
make install

# Run tests
make test

# Run linting and formatting
make check

License

MIT License

Project details

Release history Release notifications | RSS feed

This version

0.12.0

Sep 11, 2025

0.11.0

Sep 11, 2025

0.10.0

Sep 11, 2025

0.9.0

Sep 11, 2025

0.8.0

Sep 11, 2025

0.7.0

Sep 5, 2025

0.6.0

Sep 5, 2025

0.5.0

Aug 19, 2025

0.4.0

Jul 24, 2025

0.3.0

Jul 23, 2025

0.2.0

Jul 23, 2025

0.1.0

Jul 23, 2025

0.0.0

Sep 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_transform_registry-0.12.0.tar.gz (15.7 kB view details)

Uploaded Sep 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_transform_registry-0.12.0-py3-none-any.whl (9.6 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file pyspark_transform_registry-0.12.0.tar.gz.

File metadata

Download URL: pyspark_transform_registry-0.12.0.tar.gz
Upload date: Sep 11, 2025
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for pyspark_transform_registry-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`28959d95c2977d27a0abd1ab4e21e9ee987f5c1ea1b2baef55a793661b9ccff3`
MD5	`267ad6b8cd6826bba55cfd08890349ac`
BLAKE2b-256	`97c08988dc7a7a50134768e627ca2efffc86ce2ef2564b0ca325893137c2a44f`

See more details on using hashes here.

File details

Details for the file pyspark_transform_registry-0.12.0-py3-none-any.whl.

File metadata

Download URL: pyspark_transform_registry-0.12.0-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for pyspark_transform_registry-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c106134aee4d88ace8e8f72e8caf681b282d964ced31f386ca8d6f2acd6f1e09`
MD5	`d4a15cb4ae58f631b5ce624c31b1c502`
BLAKE2b-256	`ea743e541cb901733e2c6268f3eebde5778e090d0dd75ca2e3488237b63bd67f`

See more details on using hashes here.

pyspark-transform-registry 0.12.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PySpark Transform Registry

Installation

Quick Start

Register a Function

Load and Use a Transform

Features

Usage Examples

Direct Transform Registration

File-based Registration

Source Code Inspection

Managing Transform Dependencies

Requirements

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes