Skip to main content

Distributed LLM evaluation framework for Apache Spark

Project description

spark-llm-eval

Distributed LLM Evaluation Framework for Apache Spark

Python 3.10+ PySpark 3.5+ License: Apache 2.0 Downloads


Why spark-llm-eval?

Current LLM evaluation tools are designed for single-machine execution. When you need to evaluate models against millions of examples - customer support tickets, documents, transactions - they don't scale.

spark-llm-eval is built Spark-native from the ground up:

Feature Single-Machine Tools spark-llm-eval
Scale ~10K examples Millions of examples
Statistical Rigor Point estimates Confidence intervals, significance tests
Reproducibility Manual tracking Delta Lake versioning + MLflow
Cost Naive API calls Caching, batching, rate limiting
Enterprise Ready Limited Unity Catalog, governance, audit

Features

  • Distributed Inference: Spark-native with Pandas UDFs, scales linearly with executors
  • Multi-Provider Support: OpenAI, Anthropic Claude, Google Gemini
  • Statistical Rigor: Bootstrap CIs, paired t-tests, McNemar's test, Wilcoxon, effect sizes
  • Smart Rate Limiting: Token bucket algorithm with RPM and TPM limits
  • Comprehensive Metrics: Lexical, semantic (BERTScore, embeddings), LLM-as-judge
  • Response Caching: Delta-backed content-addressable cache with TTL, replay mode for metrics iteration
  • MLflow Integration: Full experiment tracking, artifact logging, model comparison
  • Delta Lake Native: Versioned datasets, time travel, ACID transactions

Installation

PyPI (Recommended)

pip install spark-llm-eval

Databricks

In a Databricks notebook:

%pip install spark-llm-eval

Or attach to your cluster via the Libraries UI using PyPI coordinates: spark-llm-eval

Development

git clone https://github.com/bassrehab/spark-llm-eval.git
cd spark-llm-eval
pip install -e ".[dev]"

Quick Start

from pyspark.sql import SparkSession
from spark_llm_eval.core.config import ModelConfig, ModelProvider, MetricConfig, StatisticsConfig
from spark_llm_eval.core.task import EvalTask
from spark_llm_eval.orchestrator import EvaluationRunner, RunnerConfig
from spark_llm_eval.datasets import load_dataset

# Initialize Spark
spark = SparkSession.builder.appName("llm-eval").getOrCreate()

# Load evaluation dataset
data = load_dataset(
    spark,
    table_path="/mnt/delta/datasets/qa_test",
    input_column="question",
    reference_column="answer",
)

# Configure model
model_config = ModelConfig(
    provider=ModelProvider.OPENAI,
    model_name="gpt-4o",
    api_key_secret="secrets/openai_key",
)

# Define evaluation task
task = EvalTask(
    task_id="qa-eval-001",
    name="QA Model Evaluation",
    dataset_path="/mnt/delta/datasets/qa_test",
    model_config=model_config,
    prompt_template="""Answer the following question concisely.

Question: {{ input }}

Answer:""",
    metrics=[
        MetricConfig(name="exact_match"),
        MetricConfig(name="f1"),
    ],
)

# Configure runner
runner_config = RunnerConfig(
    model_config=model_config,
    metrics=task.metrics,
    statistics_config=StatisticsConfig(confidence_level=0.95),
)

# Run evaluation
runner = EvaluationRunner(spark, runner_config)
result = runner.run(data, task)

# Results include confidence intervals
for name, metric in result.metrics.items():
    ci = metric.confidence_interval
    print(f"{name}: {metric.value:.4f} [{ci[0]:.4f}, {ci[1]:.4f}]")

Supported Metrics

Lexical Metrics

  • exact_match: Exact string match (case-insensitive)
  • f1: Token-level F1 score
  • bleu: BLEU score (1-4 grams)
  • rouge_l: ROUGE-L score
  • contains: Substring containment check
  • length_ratio: Response length relative to reference

Semantic Metrics

  • bertscore: BERTScore (precision, recall, F1) using contextual embeddings
  • embedding_similarity: Cosine similarity of sentence embeddings
  • semantic_similarity: Overall semantic similarity score

LLM-as-Judge

  • llm_judge: Customizable LLM-based evaluation with rubrics
  • pairwise_comparison: Compare two outputs head-to-head
  • pointwise_grading: Score individual outputs on defined criteria

RAG Metrics

  • context_relevance: Is retrieved context relevant to the query?
  • faithfulness: Is the answer grounded in context (no hallucinations)?
  • answer_relevance: Does the answer address the query?
  • context_precision: Ranking quality of retrieved chunks
  • context_recall: Coverage of ground truth in retrieved context
  • context_relevance_embedding: Embedding-based relevance (faster, cheaper)
  • answer_relevance_embedding: Embedding-based relevance (faster, cheaper)
  • faithfulness_nli: NLI-based faithfulness (faster, cheaper)

Statistical Features

All metrics include:

  • Confidence intervals (bootstrap or analytical)
  • Standard error
  • Sample size

Model comparisons include:

  • Paired t-test for continuous metrics
  • McNemar's test for binary outcomes
  • Wilcoxon signed-rank for non-parametric comparison
  • Effect size (Cohen's d, Hedges' g)
from spark_llm_eval.statistics import paired_ttest, cohens_d

# Compare two models
result = paired_ttest(model_a_scores, model_b_scores)
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.is_significant}")

# Compute effect size
effect = cohens_d(model_a_scores, model_b_scores)
print(f"Cohen's d: {effect.value:.3f} ({effect.interpretation})")

Configuration

Inference Configuration

from spark_llm_eval.core.config import InferenceConfig

inference_config = InferenceConfig(
    batch_size=32,              # Examples per batch
    max_retries=3,              # Retry on failure
    timeout=60.0,               # Request timeout (seconds)
    rate_limit_rpm=10000,       # Requests per minute
    rate_limit_tpm=1000000,     # Tokens per minute
)

Statistics Configuration

from spark_llm_eval.core.config import StatisticsConfig

stats_config = StatisticsConfig(
    confidence_level=0.95,      # 95% confidence intervals
    bootstrap_iterations=1000,  # Bootstrap samples
    significance_threshold=0.05,# p-value threshold
    compute_effect_size=True    # Include Cohen's d
)

MLflow Integration

from spark_llm_eval.tracking import create_tracker

tracker = create_tracker(
    experiment_name="llm-evaluations",
    run_name="gpt4-baseline",
)

with tracker.start_run():
    result = runner.run(data, task)
    tracker.log_results_summary({
        name: {"value": m.value, "ci_lower": m.confidence_interval[0], "ci_upper": m.confidence_interval[1]}
        for name, m in result.metrics.items()
    })

Development

# Clone repository
git clone https://github.com/bassrehab/spark-llm-eval.git
cd spark-llm-eval

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit -v

# Run linting
ruff check spark_llm_eval
black --check spark_llm_eval

Roadmap

Completed

  • Core framework with OpenAI support
  • Lexical metrics with statistical rigor
  • MLflow integration
  • Delta Lake dataset integration
  • Multi-provider support (Anthropic, Google Gemini)
  • Semantic metrics (BERTScore, embeddings)
  • LLM-as-judge
  • Agent evaluation (trajectories, tool use, multi-agent debate)
  • RAG evaluation metrics (context relevance, faithfulness, answer relevance, precision, recall)
  • Databricks notebook examples
  • PyPI package release
  • Response caching (Delta-backed) with replay mode

Planned

  • vLLM/local model support
  • Unity Catalog integration
  • Databricks Asset Bundle distribution

License

Apache License 2.0. See LICENSE for details.

Citation

If you use spark-llm-eval in your research, please cite:

@software{spark_llm_eval,
  author = {Mitra, Subhadip},
  title = {spark-llm-eval: Distributed LLM Evaluation Framework},
  year = {2025},
  url = {https://github.com/bassrehab/spark-llm-eval}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_llm_eval-0.1.3.tar.gz (77.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_llm_eval-0.1.3-py3-none-any.whl (92.7 kB view details)

Uploaded Python 3

File details

Details for the file spark_llm_eval-0.1.3.tar.gz.

File metadata

  • Download URL: spark_llm_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 77.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_llm_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d52ad4bfad6ddeb65f71deb3b4fa8247249d3f6c4dd0764387e2559a17c381c7
MD5 6550a05aee270e9ebb6840d4ede5c04c
BLAKE2b-256 c555b048420ce023b4bbc0acb4101804d12944c00f7b67964cd3c939d632798b

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_llm_eval-0.1.3.tar.gz:

Publisher: publish.yml on bassrehab/spark-llm-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spark_llm_eval-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: spark_llm_eval-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spark_llm_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0db85bdbc316ec1067a7b5cad2679ecd691acb911aff352574a8e49cf16bb967
MD5 d916c7d89f91813f8d1efb6866127c32
BLAKE2b-256 9835f05741612866d70b5903a9e9894124dc7923357db38562b694775c20239a

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_llm_eval-0.1.3-py3-none-any.whl:

Publisher: publish.yml on bassrehab/spark-llm-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page