Distributed AI inference framework for PySpark
Project description
spark-ai
Distributed AI inference for PySpark DataFrames.
spark-ai brings model-powered text processing directly into Spark transformations using a simple API.
It is designed for portability and works anywhere Spark runs: local development, EMR, Dataproc, Kubernetes, or on-prem clusters.
Features
- Spark-native sentiment analysis API
- Vectorized execution with
pandas_udffor better throughput than row-wise Python UDFs - Hugging Face Transformers backend
- Null-safe text handling for production pipelines
- Clean package structure for extension with additional AI tasks/backends
Installation
pip install spark-ai
Requirements
- Python 3.10+
- Apache Spark 3.5+
- Java runtime compatible with your Spark distribution
Core dependencies are installed automatically:
pysparkpandaspyarrowtransformerstorch
Quick Start
from pyspark.sql import SparkSession
from spark_ai import AI
spark = (
SparkSession.builder
.appName("spark-ai-demo")
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.getOrCreate()
)
df = spark.createDataFrame(
[
("I love this product!",),
("This is the worst experience ever.",),
],
["review"],
)
ai = AI()
result = df.withColumn("sentiment", ai.sentiment("review"))
result.show(truncate=False)
Expected sentiment labels are typically POSITIVE / NEGATIVE (model-dependent).
API
AI
Primary interface for DataFrame AI transformations.
AI.sentiment(column_name: str)
Applies sentiment analysis to a text column and returns a Spark Column.
result = df.withColumn("sentiment", ai.sentiment("review"))
Performance Notes
spark-ai uses a vectorized Pandas UDF and batched Hugging Face inference internally.
For best performance in production:
- Enable Arrow:
spark.sql.execution.arrow.pyspark.enabled=true
- Tune Spark partitions to match your cluster resources
- Tune
batch_sizefor your hardware, or enableauto_tune_batch_size=True - Run benchmarks on representative text lengths and data sizes
Model-loading behavior:
- Spark may run multiple Python workers per executor
- Each Python worker keeps its own singleton model instance
- That means model reuse is per worker process, not globally shared across all workers
You can use the included benchmark script:
python examples/benchmark_sentiment.py
Example benchmark output:
rows=20000
partitions=8
elapsed_seconds=6.383
rows_per_second=3133.1
Logging and Runtime Warnings
Common Spark startup warnings like:
NativeCodeLoader: Unable to load native-hadoop library...- JDK incubator module notices
are typically informational in local environments and do not indicate a failure.
Development
Clone and install in editable mode:
pip install -e ".[dev]"
Run tests:
pytest -q
Project Structure
src/spark_ai/
ai.py # Public API
config.py # Central configuration
udf/sentiment_udf.py # Vectorized Spark UDF
backends/ # Inference backend implementations
tests/unit/ # Unit tests
examples/ # Usage and benchmark scripts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark_infer_ai-0.1.1.tar.gz.
File metadata
- Download URL: spark_infer_ai-0.1.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8103481178e3b9242603564e6bce86bde43ea305f8a6f3431bfa44d5c20ca6f0
|
|
| MD5 |
24ba044c67ccabc81d75277366d796f3
|
|
| BLAKE2b-256 |
3664d9f62e3d1e67baeab6ba1c0e63a8873e8afb929642046b07a057ae47df53
|
File details
Details for the file spark_infer_ai-0.1.1-py3-none-any.whl.
File metadata
- Download URL: spark_infer_ai-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11b46e243d2767576714e9a8f43a514662b2d6d24c22894f3a66428da4688e45
|
|
| MD5 |
b54324204e5008d8c9ca64605946a218
|
|
| BLAKE2b-256 |
9c09576a0597ec6a22b1bb276a61baf2cc090443e82e13ce263c277d5c58b29c
|