Skip to main content

A collection of PySpark User-Defined Table Functions (UDTFs)

Project description

PySpark UDTF Examples

A collection of Python User-Defined Table Functions (UDTFs) for PySpark, demonstrating how to leverage UDTFs for complex data processing tasks.

Requirements

  • Python >= 3.10
  • PySpark >= 4.0.0
  • requests
  • pandas
  • pyarrow

Installation

We recommend using uv for extremely fast package management.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package
uv add pyspark-udtf

Usage

Batch Inference Image Captioning

This UDTF demonstrates how to perform efficient batch inference against a model serving endpoint. It buffers rows and sends them in batches to reduce network overhead.

from pyspark.sql import SparkSession
from pyspark_udtf.udtfs import BatchInferenceImageCaption

spark = SparkSession.builder.getOrCreate()

# Register the UDTF
spark.udtf.register("batch_image_caption", BatchInferenceImageCaption)

# View UDTF definition and parameters
help(BatchInferenceImageCaption.func)

# Usage in SQL
# Assuming you have a table 'images' with a column 'url'
spark.sql("""
    SELECT * 
    FROM batch_image_caption(
        TABLE(SELECT url FROM images), 
        10,  -- batch_size
        'your-api-token', 
        'https://your-endpoint.com/score'
    )
""").show()

Development

This project uses uv for dependency management and packaging.

Running Tests

To run the test suite:

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_image_caption.py

Adding Dependencies

To add a new runtime dependency:

uv add package_name

To add a development dependency:

uv add --dev package_name

Bumping Version

Currently, versioning is managed manually in pyproject.toml.

  1. Open pyproject.toml.
  2. Update the version field under [project]:
    [project]
    version = "0.1.1"  # Update this value
    

Publishing to PyPI

To build and publish the package to PyPI:

  1. Build the package:

    uv build
    

    This will create distributions in the dist/ directory.

  2. Publish to PyPI:

    uv publish
    

    Note: You will need to configure your PyPI credentials (API token) either via environment variables (UV_PUBLISH_TOKEN) or following uv's authentication documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_udtf-0.1.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_udtf-0.1.0-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_udtf-0.1.0.tar.gz.

File metadata

  • Download URL: pyspark_udtf-0.1.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for pyspark_udtf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7375fd52f01775960066fa015cc56b579e0b43fdbedb363b697ac99c5ef294d9
MD5 63fa6f2d7d9dbea673b5a68bdd923547
BLAKE2b-256 7a20cd50e0995b4e9679a9858af9ccd0e0a13d83e12701b66bd452c26f101410

See more details on using hashes here.

File details

Details for the file pyspark_udtf-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_udtf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4e184fd152764aef57a16989e8b56bc28c5a536ed9613e8ac5b70980e3fece5
MD5 99064b180488659b6b6a5a2ebe8e3246
BLAKE2b-256 df28425b59c16c65da8fb1a7abf9fc9ebd1583a517a3924872e4013c0c007da5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page