Distributed AI inference framework for PySpark

Project description

spark-ai

Distributed AI inference for PySpark DataFrames.

spark-ai brings model-powered text processing directly into Spark transformations using a simple API.

It is designed for portability and works anywhere Spark runs: local development, EMR, Dataproc, Kubernetes, or on-prem clusters.

Features

Spark-native sentiment analysis API
Vectorized execution with pandas_udf for better throughput than row-wise Python UDFs
Hugging Face Transformers backend
Null-safe text handling for production pipelines
Clean package structure for extension with additional AI tasks/backends

Installation

pip install spark-ai

Requirements

Python 3.10+
Apache Spark 3.5+
Java runtime compatible with your Spark distribution

Core dependencies are installed automatically:

pyspark
pandas
pyarrow
transformers
torch

Quick Start

from pyspark.sql import SparkSession
from spark_ai import AI

spark = (
    SparkSession.builder
    .appName("spark-ai-demo")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

df = spark.createDataFrame(
    [
        ("I love this product!",),
        ("This is the worst experience ever.",),
    ],
    ["review"],
)

ai = AI()
result = df.withColumn("sentiment", ai.sentiment("review"))
result.show(truncate=False)

Expected sentiment labels are typically POSITIVE / NEGATIVE (model-dependent).

API

`AI`

Primary interface for DataFrame AI transformations.

`AI.sentiment(column_name: str)`

Applies sentiment analysis to a text column and returns a Spark Column.

result = df.withColumn("sentiment", ai.sentiment("review"))

Performance Notes

spark-ai uses a vectorized Pandas UDF and batched Hugging Face inference internally.

For best performance in production:

Enable Arrow:
- spark.sql.execution.arrow.pyspark.enabled=true
Tune Spark partitions to match your cluster resources
Tune batch_size for your hardware, or enable auto_tune_batch_size=True
Run benchmarks on representative text lengths and data sizes

Model-loading behavior:

Spark may run multiple Python workers per executor
Each Python worker keeps its own singleton model instance
That means model reuse is per worker process, not globally shared across all workers

You can use the included benchmark script:

python examples/benchmark_sentiment.py

Example benchmark output:

rows=20000
partitions=8
elapsed_seconds=6.383
rows_per_second=3133.1

Logging and Runtime Warnings

Common Spark startup warnings like:

NativeCodeLoader: Unable to load native-hadoop library...
JDK incubator module notices

are typically informational in local environments and do not indicate a failure.

Development

Clone and install in editable mode:

pip install -e ".[dev]"

Run tests:

pytest -q

Project Structure

src/spark_ai/
  ai.py                  # Public API
  config.py              # Central configuration
  udf/sentiment_udf.py   # Vectorized Spark UDF
  backends/              # Inference backend implementations
tests/unit/              # Unit tests
examples/                # Usage and benchmark scripts

Project details

Release history Release notifications | RSS feed

0.1.3

May 9, 2026

0.1.2

May 8, 2026

This version

0.1.1

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_infer_ai-0.1.1.tar.gz (11.9 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_infer_ai-0.1.1-py3-none-any.whl (12.2 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file spark_infer_ai-0.1.1.tar.gz.

File metadata

Download URL: spark_infer_ai-0.1.1.tar.gz
Upload date: May 8, 2026
Size: 11.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for spark_infer_ai-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8103481178e3b9242603564e6bce86bde43ea305f8a6f3431bfa44d5c20ca6f0`
MD5	`24ba044c67ccabc81d75277366d796f3`
BLAKE2b-256	`3664d9f62e3d1e67baeab6ba1c0e63a8873e8afb929642046b07a057ae47df53`

See more details on using hashes here.

File details

Details for the file spark_infer_ai-0.1.1-py3-none-any.whl.

File metadata

Download URL: spark_infer_ai-0.1.1-py3-none-any.whl
Upload date: May 8, 2026
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for spark_infer_ai-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11b46e243d2767576714e9a8f43a514662b2d6d24c22894f3a66428da4688e45`
MD5	`b54324204e5008d8c9ca64605946a218`
BLAKE2b-256	`9c09576a0597ec6a22b1bb276a61baf2cc090443e82e13ce263c277d5c58b29c`

See more details on using hashes here.

spark-infer-ai 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

spark-ai

Features

Installation

Requirements

Quick Start

API

`AI`

`AI.sentiment(column_name: str)`

Performance Notes

Logging and Runtime Warnings

Development

Project Structure

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes