Skip to main content

Distributed AI inference framework for PySpark

Project description

spark-infer-ai

Distributed AI inference for PySpark DataFrames.

Tests PyPI version Python 3.10+ License

spark-ai brings model-powered text processing directly into Spark transformations using a simple API.

It is designed for portability and works anywhere Spark runs: local development, EMR, Dataproc, Kubernetes, or on-prem clusters.

Features

  • Spark-native sentiment and zero-shot text classification APIs
  • Vectorized execution with pandas_udf for better throughput than row-wise Python UDFs
  • Hugging Face Transformers backend
  • Null-safe text handling for production pipelines
  • Clean package structure for extension with additional AI tasks/backends

Installation

pip install spark-infer-ai

Requirements

  • Python 3.10+
  • Apache Spark 3.5+
  • Java runtime compatible with your Spark distribution

Core dependencies are installed automatically:

  • pyspark
  • pandas
  • pyarrow
  • transformers
  • torch

Quick Start

from pyspark.sql import SparkSession
from spark_ai import AI

spark = (
    SparkSession.builder
    .appName("spark-ai-demo")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

df = spark.createDataFrame(
    [
        ("I love this product!",),
        ("This is the worst experience ever.",),
    ],
    ["review"],
)

ai = AI()
result = df.withColumn("sentiment", ai.sentiment("review"))
result.show(truncate=False)

Expected sentiment labels are typically POSITIVE / NEGATIVE (model-dependent).

Classify custom labels with zero-shot inference:

topic_result = df.withColumn(
    "topic",
    ai.classify("review", labels=["urgent", "complaint", "praise"]),
)
topic_result.show(truncate=False)

API

AI

Primary interface for DataFrame AI transformations.

AI.sentiment(column_name: str)

Applies sentiment analysis to a text column and returns a Spark Column.

result = df.withColumn("sentiment", ai.sentiment("review"))

AI.classify(text_col: str, labels: list[str])

Categorizes free-text into your custom labels with a zero-shot classification

result = df.withColumn(
    "topic",
    ai.classify("message", labels=["urgent", "spam", "normal"]),
)

Performance Notes

spark-ai uses a vectorized Pandas UDF and batched Hugging Face inference internally.

For best performance in production:

  • Enable Arrow:
    • spark.sql.execution.arrow.pyspark.enabled=true
  • Tune Spark partitions to match your cluster resources
  • Tune batch_size for your hardware, or enable auto_tune_batch_size=True
  • Run benchmarks on representative text lengths and data sizes

Model-loading behavior:

  • Spark may run multiple Python workers per executor
  • Each Python worker keeps its own singleton model instance
  • That means model reuse is per worker process, not globally shared across all workers

You can use the included benchmark script:

python examples/benchmark_sentiment.py

Example benchmark output:

rows=20000
partitions=8
elapsed_seconds=6.383
rows_per_second=3133.1

Logging and Runtime Warnings

Common Spark startup warnings like:

  • NativeCodeLoader: Unable to load native-hadoop library...
  • JDK incubator module notices

are typically informational in local environments and do not indicate a failure.

Development

Clone and install in editable mode:

pip install -e ".[dev]"

Run tests:

pytest -q

Project Structure

src/spark_ai/
  ai.py                  # Public API
  config.py              # Central configuration
  udf/                   # Vectorized Spark UDF
  backends/              # Inference backend implementations
tests/unit/              # Unit tests
examples/                # Usage and benchmark scripts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_infer_ai-0.1.3.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_infer_ai-0.1.3-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file spark_infer_ai-0.1.3.tar.gz.

File metadata

  • Download URL: spark_infer_ai-0.1.3.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for spark_infer_ai-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c6874e8fc41d0e9f8c65882d988e089f99f991fdbb25950729eb857466ee5be6
MD5 0416005f9aaf5520805ebdd711ced178
BLAKE2b-256 706ea5422450d83b06ef07bd06c609d109c3abd344c62bf2657d2a27eabc408f

See more details on using hashes here.

File details

Details for the file spark_infer_ai-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: spark_infer_ai-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for spark_infer_ai-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3055b43626858b6fa0cc2b76a5c598dcc90520b309acb3003b60fc553dfa55cf
MD5 9aa14d28cfcdddf5c9bb82f3c5649984
BLAKE2b-256 0cfeecbd4c15e8b8a9f82b0c9cbbd24693444eab9f3b8cee290538e7972a12c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page