Skip to main content

A collection of PySpark User-Defined Table Functions (UDTFs)

Project description

PySpark UDTF Examples

PyPI uv Ruff

A collection of Python User-Defined Table Functions (UDTFs) for PySpark, demonstrating how to leverage UDTFs for complex data processing tasks.

Installation

You can quickly install the package using pip:

pip install pyspark-udtf

Usage

Fuzzy Matching (Quick Start)

This UDTF demonstrates how to use Python's standard library difflib to perform fuzzy string matching in PySpark. It takes a target string and a list of candidates, returning the best match and a similarity score.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import FuzzyMatch

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("fuzzy_match", FuzzyMatch)

# Create a sample dataframe with typos
data = [
    ("aple", ["apple", "banana", "orange"]),
    ("bananna", ["apple", "banana", "orange"]),
    ("orange", ["apple", "banana", "orange"]),
    ("grape", ["apple", "banana", "orange"]) 
]
df = spark.createDataFrame(data, ["typo", "candidates"])

# Use the UDTF in SQL
df.createOrReplaceTempView("typos")

spark.sql("""
    SELECT * 
    FROM fuzzy_match(TABLE(SELECT typo, candidates FROM typos))
""").show()

Batch Inference Image Captioning

This UDTF demonstrates how to perform efficient batch inference against a model serving endpoint. It buffers rows and sends them in batches to reduce network overhead.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import BatchInferenceImageCaption

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("batch_image_caption", BatchInferenceImageCaption)

# View UDTF definition and parameters
help(BatchInferenceImageCaption.func)

# Usage in SQL
# Assuming you have a table 'images' with a column 'url'
spark.sql("""
    SELECT * 
    FROM batch_image_caption(
        TABLE(SELECT url FROM images), 
        10,  -- batch_size
        'your-api-token', 
        'https://your-endpoint.com/score'
    )
""").show()

Requirements

  • Python >= 3.10
  • PySpark >= 4.0.0
  • requests
  • pandas
  • pyarrow

Documentation

For more detailed documentation, including design docs and guides for Unity Catalog integration, see the docs/ directory.

Development

We recommend using uv for extremely fast package management.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package
uv add pyspark-udtf

Running Tests

To run the test suite:

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_image_caption.py

Linting

This project uses Ruff for linting and formatting. Install dev dependencies, then run:

uv sync --extra dev   # install ruff
uv run ruff check .   # lint
uv run ruff format .  # format

Adding Dependencies

To add a new runtime dependency:

uv add package_name

To add a development dependency:

uv add --dev package_name

Bumping Version

You can bump the version automatically using uv (requires uv >= 0.7.0):

# Bump patch version (0.1.0 -> 0.1.1)
uv version --bump patch

# Bump minor version (0.1.0 -> 0.2.0)
uv version --bump minor

Alternatively, you can manually update pyproject.toml:

  1. Open pyproject.toml.
  2. Update the version field under [project]:
    [project]
    version = "0.1.1"  # Update this value
    

Publishing to PyPI

To build and publish the package to PyPI:

  1. Build the package:

    uv build
    

    This will create distributions in the dist/ directory.

  2. Publish to PyPI:

    uv publish
    

    Note: You will need to configure your PyPI credentials (API token) either via environment variables (UV_PUBLISH_TOKEN) or following uv's authentication documentation.

Cursor Skills

This repository includes Cursor skills to help with common development tasks. Skills are available in .cursor/skills/.

create-udtf

Use this skill when you want to create, write, or generate a new PySpark UDTF. It guides you through:

  1. Analyze requirements – Determine inputs, outputs, and external dependencies
  2. Design – Create a design doc in docs/design/<udtf_name>.md (required for all UDTFs)
  3. Implementation – Implement the UDTF in src/pyspark_udtf/udtfs/<udtf_name>.py
  4. Registration – Add the UDTF to src/pyspark_udtf/udtfs/__init__.py
  5. Testing – Add tests in tests/test_<udtf_name>.py

When to use: Ask Cursor to create a new UDTF, or say "use the create-udtf skill" when describing the UDTF you want to build.

Reference implementations:

  • Simple UDTF: src/pyspark_udtf/udtfs/fuzzy_match.py
  • Complex UDTF (buffering, external API): src/pyspark_udtf/udtfs/meta_capi.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_udtf-0.1.3.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_udtf-0.1.3-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_udtf-0.1.3.tar.gz.

File metadata

  • Download URL: pyspark_udtf-0.1.3.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyspark_udtf-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7b8b884067829a9884a1b07e9133c9679deb1eef09b6e13d5e5d37157a5d7d16
MD5 b3a76b2b0f60c561ea3a5b63cb1dbb3a
BLAKE2b-256 6eaa17837722f8dce8c68af07dc9797b48930f818230d58aea045c1734561ff2

See more details on using hashes here.

File details

Details for the file pyspark_udtf-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pyspark_udtf-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyspark_udtf-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 032c11f25bb1200b8f305d6be0ca3513b930828a1394e54712dcb9fe008de0c7
MD5 fc475366f37d99eece9b4af388222f55
BLAKE2b-256 440f73e27fb8ecc3e4683acbeebed93c5f15af89f904fcbf0e9f430ff725f5fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page