Skip to main content

A collection of PySpark User-Defined Table Functions (UDTFs)

Project description

PySpark UDTF Examples

PyPI uv Ruff

A collection of Python User-Defined Table Functions (UDTFs) for PySpark, demonstrating how to leverage UDTFs for complex data processing tasks.

Installation

You can quickly install the package using pip:

pip install pyspark-udtf

Usage

Fuzzy Matching (Quick Start)

This UDTF demonstrates how to use Python's standard library difflib to perform fuzzy string matching in PySpark. It takes a target string and a list of candidates, returning the best match and a similarity score.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import FuzzyMatch

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("fuzzy_match", FuzzyMatch)

# Create a sample dataframe with typos
data = [
    ("aple", ["apple", "banana", "orange"]),
    ("bananna", ["apple", "banana", "orange"]),
    ("orange", ["apple", "banana", "orange"]),
    ("grape", ["apple", "banana", "orange"]) 
]
df = spark.createDataFrame(data, ["typo", "candidates"])

# Use the UDTF in SQL
df.createOrReplaceTempView("typos")

spark.sql("""
    SELECT * 
    FROM fuzzy_match(TABLE(SELECT typo, candidates FROM typos))
""").show()

Batch Inference Image Captioning

This UDTF demonstrates how to perform efficient batch inference against a model serving endpoint. It buffers rows and sends them in batches to reduce network overhead.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import BatchInferenceImageCaption

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("batch_image_caption", BatchInferenceImageCaption)

# View UDTF definition and parameters
help(BatchInferenceImageCaption.func)

# Usage in SQL
# Assuming you have a table 'images' with a column 'url'
spark.sql("""
    SELECT * 
    FROM batch_image_caption(
        TABLE(SELECT url FROM images), 
        10,  -- batch_size
        'your-api-token', 
        'https://your-endpoint.com/score'
    )
""").show()

Requirements

  • Python >= 3.10
  • PySpark >= 4.0.0
  • requests
  • pandas
  • pyarrow

Documentation

For more detailed documentation, including design docs and guides for Unity Catalog integration, see the docs/ directory.

Development

We recommend using uv for extremely fast package management.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package
uv add pyspark-udtf

Running Tests

To run the test suite:

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_image_caption.py

Linting

This project uses Ruff for linting and formatting. Install dev dependencies, then run:

uv sync --extra dev   # install ruff
uv run ruff check .   # lint
uv run ruff format .  # format

Adding Dependencies

To add a new runtime dependency:

uv add package_name

To add a development dependency:

uv add --dev package_name

Bumping Version

You can bump the version automatically using uv (requires uv >= 0.7.0):

# Bump patch version (0.1.0 -> 0.1.1)
uv version --bump patch

# Bump minor version (0.1.0 -> 0.2.0)
uv version --bump minor

Alternatively, you can manually update pyproject.toml:

  1. Open pyproject.toml.
  2. Update the version field under [project]:
    [project]
    version = "0.1.1"  # Update this value
    

Publishing to PyPI

To build and publish the package to PyPI:

  1. Build the package:

    uv build
    

    This will create distributions in the dist/ directory.

  2. Publish to PyPI:

    uv publish
    

    Note: You will need to configure your PyPI credentials (API token) either via environment variables (UV_PUBLISH_TOKEN) or following uv's authentication documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_udtf-0.1.2.tar.gz (72.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_udtf-0.1.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_udtf-0.1.2.tar.gz.

File metadata

  • Download URL: pyspark_udtf-0.1.2.tar.gz
  • Upload date:
  • Size: 72.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyspark_udtf-0.1.2.tar.gz
Algorithm Hash digest
SHA256 686180e9ff4b748f69c9fde3d2c5fdad91fadeb856229db8c95904d6c478e20b
MD5 29ffd813a17c019ac0a6d277cfa11e32
BLAKE2b-256 be478ac75b5dfd62ccd2afbffa1fb6ca872fc59f3dec94ff14f1ea401e4426f6

See more details on using hashes here.

File details

Details for the file pyspark_udtf-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pyspark_udtf-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyspark_udtf-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 84e45873aea205a56f22c92f86b95cc8224d485ade117190255deb11c40fd415
MD5 0c675d3ae047f189650d9fd30bc708db
BLAKE2b-256 dc6a82d90b3805e278a6d932907d6cbf0445d5e59dfff4bb175002bbae57aab8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page