Skip to main content

A collection of PySpark User-Defined Table Functions (UDTFs)

Project description

PySpark UDTF Examples

PyPI uv

A collection of Python User-Defined Table Functions (UDTFs) for PySpark, demonstrating how to leverage UDTFs for complex data processing tasks.

Installation

You can quickly install the package using pip:

pip install pyspark-udtf

Usage

Fuzzy Matching (Quick Start)

This UDTF demonstrates how to use Python's standard library difflib to perform fuzzy string matching in PySpark. It takes a target string and a list of candidates, returning the best match and a similarity score.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import FuzzyMatch

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("fuzzy_match", FuzzyMatch)

# Create a sample dataframe with typos
data = [
    ("aple", ["apple", "banana", "orange"]),
    ("bananna", ["apple", "banana", "orange"]),
    ("orange", ["apple", "banana", "orange"]),
    ("grape", ["apple", "banana", "orange"]) 
]
df = spark.createDataFrame(data, ["typo", "candidates"])

# Use the UDTF in SQL
df.createOrReplaceTempView("typos")

spark.sql("""
    SELECT * 
    FROM fuzzy_match(TABLE(SELECT typo, candidates FROM typos))
""").show()

Batch Inference Image Captioning

This UDTF demonstrates how to perform efficient batch inference against a model serving endpoint. It buffers rows and sends them in batches to reduce network overhead.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import BatchInferenceImageCaption

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("batch_image_caption", BatchInferenceImageCaption)

# View UDTF definition and parameters
help(BatchInferenceImageCaption.func)

# Usage in SQL
# Assuming you have a table 'images' with a column 'url'
spark.sql("""
    SELECT * 
    FROM batch_image_caption(
        TABLE(SELECT url FROM images), 
        10,  -- batch_size
        'your-api-token', 
        'https://your-endpoint.com/score'
    )
""").show()

Requirements

  • Python >= 3.10
  • PySpark >= 4.0.0
  • requests
  • pandas
  • pyarrow

Documentation

For more detailed documentation, including design docs and guides for Unity Catalog integration, see the docs/ directory.

Development

We recommend using uv for extremely fast package management.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package
uv add pyspark-udtf

Running Tests

To run the test suite:

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_image_caption.py

Adding Dependencies

To add a new runtime dependency:

uv add package_name

To add a development dependency:

uv add --dev package_name

Bumping Version

You can bump the version automatically using uv (requires uv >= 0.7.0):

# Bump patch version (0.1.0 -> 0.1.1)
uv version --bump patch

# Bump minor version (0.1.0 -> 0.2.0)
uv version --bump minor

Alternatively, you can manually update pyproject.toml:

  1. Open pyproject.toml.
  2. Update the version field under [project]:
    [project]
    version = "0.1.1"  # Update this value
    

Publishing to PyPI

To build and publish the package to PyPI:

  1. Build the package:

    uv build
    

    This will create distributions in the dist/ directory.

  2. Publish to PyPI:

    uv publish
    

    Note: You will need to configure your PyPI credentials (API token) either via environment variables (UV_PUBLISH_TOKEN) or following uv's authentication documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_udtf-0.1.1.tar.gz (64.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_udtf-0.1.1-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_udtf-0.1.1.tar.gz.

File metadata

  • Download URL: pyspark_udtf-0.1.1.tar.gz
  • Upload date:
  • Size: 64.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for pyspark_udtf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c96a678fa6136f4a07a7a04f113e7e4646674a164941ec2f45b6009bc7100e09
MD5 e560a68ff5877faab864c7dba1b870bb
BLAKE2b-256 bbf3072a520c6e5e7dd3024fffbb7f649a694e0cf05bf8c0bb45cfa297ed3519

See more details on using hashes here.

File details

Details for the file pyspark_udtf-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_udtf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ffc2e395f9722a0695d19e3be62a8e1ae036660df1afc53bfe9910c41080c4b8
MD5 def061701b889583d800b6b9f6897655
BLAKE2b-256 3b01cad9db0e766e85359daa5cd289bc8fbb2e3629a65507bf8a73b6751515a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page