Distributed LLM evaluation framework for Apache Spark

These details have not been verified by PyPI

Project links

Project description

spark-llm-eval

Distributed LLM Evaluation Framework for Apache Spark

Why spark-llm-eval?

Current LLM evaluation tools are designed for single-machine execution. When you need to evaluate models against millions of examples - customer support tickets, documents, transactions - they don't scale.

spark-llm-eval is built Spark-native from the ground up:

Feature	Single-Machine Tools	spark-llm-eval
Scale	~10K examples	Millions of examples
Statistical Rigor	Point estimates	Confidence intervals, significance tests
Reproducibility	Manual tracking	Delta Lake versioning + MLflow
Cost	Naive API calls	Caching, batching, rate limiting
Enterprise Ready	Limited	Unity Catalog, governance, audit

Features

Distributed Inference: Spark-native with Pandas UDFs, scales linearly with executors
Multi-Provider Support: OpenAI, Anthropic Claude, Google Gemini
Statistical Rigor: Bootstrap CIs, paired t-tests, McNemar's test, Wilcoxon, effect sizes
Smart Rate Limiting: Token bucket algorithm with RPM and TPM limits
Comprehensive Metrics: Lexical, semantic (BERTScore, embeddings), LLM-as-judge
MLflow Integration: Full experiment tracking, artifact logging, model comparison
Delta Lake Native: Versioned datasets, time travel, ACID transactions

Installation

PyPI (Recommended)

pip install spark-llm-eval

Databricks

In a Databricks notebook:

%pip install spark-llm-eval

Or attach to your cluster via the Libraries UI using PyPI coordinates: spark-llm-eval

Development

git clone https://github.com/bassrehab/spark-llm-eval.git
cd spark-llm-eval
pip install -e ".[dev]"

Quick Start

from pyspark.sql import SparkSession
from spark_llm_eval.core.config import ModelConfig, ModelProvider, MetricConfig, StatisticsConfig
from spark_llm_eval.core.task import EvalTask
from spark_llm_eval.orchestrator import EvaluationRunner, RunnerConfig
from spark_llm_eval.datasets import load_dataset

# Initialize Spark
spark = SparkSession.builder.appName("llm-eval").getOrCreate()

# Load evaluation dataset
data = load_dataset(
    spark,
    table_path="/mnt/delta/datasets/qa_test",
    input_column="question",
    reference_column="answer",
)

# Configure model
model_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4o",
    api_key_secret="secrets/openai_key",
)

# Define evaluation task
task = EvalTask(
    task_id="qa-eval-001",
    name="QA Model Evaluation",
    dataset_path="/mnt/delta/datasets/qa_test",
    model_config=model_config,
    prompt_template="""Answer the following question concisely.

Question: {{ input }}

Answer:""",
    metrics=[
        MetricConfig(name="exact_match"),
        MetricConfig(name="f1"),
    ],
)

# Configure runner
runner_config = RunnerConfig(
    model_config=model_config,
    metrics=task.metrics,
    statistics_config=StatisticsConfig(confidence_level=0.95),
)

# Run evaluation
runner = EvaluationRunner(spark, runner_config)
result = runner.run(data, task)

# Results include confidence intervals
for name, metric in result.metrics.items():
    ci = metric.confidence_interval
    print(f"{name}: {metric.value:.4f} [{ci[0]:.4f}, {ci[1]:.4f}]")

Supported Metrics

Lexical Metrics

exact_match: Exact string match (case-insensitive)
f1: Token-level F1 score
bleu: BLEU score (1-4 grams)
rouge_l: ROUGE-L score
contains: Substring containment check
length_ratio: Response length relative to reference

Semantic Metrics

bertscore: BERTScore (precision, recall, F1) using contextual embeddings
embedding_similarity: Cosine similarity of sentence embeddings
semantic_similarity: Overall semantic similarity score

LLM-as-Judge

llm_judge: Customizable LLM-based evaluation with rubrics
pairwise_comparison: Compare two outputs head-to-head
pointwise_grading: Score individual outputs on defined criteria

Statistical Features

All metrics include:

Confidence intervals (bootstrap or analytical)
Standard error
Sample size

Model comparisons include:

Paired t-test for continuous metrics
McNemar's test for binary outcomes
Wilcoxon signed-rank for non-parametric comparison
Effect size (Cohen's d, Hedges' g)

from spark_llm_eval.statistics import paired_ttest, cohens_d

# Compare two models
result = paired_ttest(model_a_scores, model_b_scores)
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.is_significant}")

# Compute effect size
effect = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {effect.value:.3f} ({effect.interpretation})")

Configuration

Inference Configuration

from spark_llm_eval.core.config import InferenceConfig

inference_config = InferenceConfig(
    batch_size=32,              # Examples per batch
    max_retries=3,              # Retry on failure
    timeout=60.0,               # Request timeout (seconds)
    rate_limit_rpm=10000,       # Requests per minute
    rate_limit_tpm=1000000,     # Tokens per minute
)

Statistics Configuration

from spark_llm_eval.core.config import StatisticsConfig

stats_config = StatisticsConfig(
    confidence_level=0.95,      # 95% confidence intervals
    bootstrap_iterations=1000,  # Bootstrap samples
    significance_threshold=0.05,# p-value threshold
    compute_effect_size=True    # Include Cohen's d
)

MLflow Integration

from spark_llm_eval.tracking import create_tracker

tracker = create_tracker(
    experiment_name="llm-evaluations",
    run_name="gpt4-baseline",
)

with tracker.start_run():
    result = runner.run(data, task)
    tracker.log_results_summary({
        name: {"value": m.value, "ci_lower": m.confidence_interval[0], "ci_upper": m.confidence_interval[1]}
        for name, m in result.metrics.items()
    })

Development

# Clone repository
git clone https://github.com/bassrehab/spark-llm-eval.git
cd spark-llm-eval

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit -v

# Run linting
ruff check spark_llm_eval
black --check spark_llm_eval

Roadmap

Completed

Core framework with OpenAI support
Lexical metrics with statistical rigor
MLflow integration
Delta Lake dataset integration
Multi-provider support (Anthropic, Google Gemini)
Semantic metrics (BERTScore, embeddings)
LLM-as-judge
Agent evaluation (trajectories, tool use, multi-agent debate)

In Progress

RAG evaluation metrics (context relevance, faithfulness)
Databricks notebook examples
PyPI package release

Planned

vLLM/local model support
Response caching (Delta-backed)
Unity Catalog integration
Databricks Asset Bundle distribution

License

Apache License 2.0. See LICENSE for details.

Citation

If you use spark-llm-eval in your research, please cite:

@software{spark_llm_eval,
  author = {Mitra, Subhadip},
  title = {spark-llm-eval: Distributed LLM Evaluation Framework},
  year = {2025},
  url = {https://github.com/bassrehab/spark-llm-eval}
}

Acknowledgments

Inspired by lm-evaluation-harness, ragas, and deepeval
Statistical methods based on Berg-Kirkpatrick et al. (2012)
Built for the Databricks ecosystem

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Dec 21, 2025

0.1.2

Dec 21, 2025

0.1.1

Dec 15, 2025

This version

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_llm_eval-0.1.0.tar.gz (64.7 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_llm_eval-0.1.0-py3-none-any.whl (76.5 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file spark_llm_eval-0.1.0.tar.gz.

File metadata

Download URL: spark_llm_eval-0.1.0.tar.gz
Upload date: Dec 15, 2025
Size: 64.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for spark_llm_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`17722d8b11b1e2fe8bd31a517f13ca6f56077d51165f354f45219ea7b4b98a91`
MD5	`b57478c6867d599eea46b9c61dbffae3`
BLAKE2b-256	`88b816c92b14e6942fcd772b4a0ded74b59e67e9527ef90247006fdf410d5321`

See more details on using hashes here.

File details

Details for the file spark_llm_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: spark_llm_eval-0.1.0-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 76.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for spark_llm_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19aed2e4fdfbc0e4622703851677d8482cf845bed8a78a1fb8aa746b5055c257`
MD5	`eac2726acda4d13c1d33df65fead5fad`
BLAKE2b-256	`7cc3b433ba91928357c0029de597a6ca89aa664e543d25e31fb8da84bc34d809`

See more details on using hashes here.

spark-llm-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

spark-llm-eval

Why spark-llm-eval?

Features

Installation

PyPI (Recommended)

Databricks

Development

Quick Start

Supported Metrics

Lexical Metrics

Semantic Metrics

LLM-as-Judge

Statistical Features

Configuration

Inference Configuration

Statistics Configuration

MLflow Integration

Development

Roadmap

Completed

In Progress

Planned

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes