Skip to main content

LLM evaluation harness with custom metrics, LLM-as-judge, and regression tracking

Project description

๐Ÿ”ฌ LLMBench

LLM evaluation harness built from scratch.

LLMBench runs structured, reproducible evaluations across four metric families โ€” lexical, semantic, LLM-as-judge, and calibration โ€” stores every run, and diffs any two runs to surface regressions. Plug in any task type or metric in under 10 lines.


Why LLMBench?

Most teams swap models or update prompts and then eyeball outputs to decide if things got better. LLMBench gives you:

  • Four metric families โ€” not just ROUGE, but semantic similarity, LLM-as-judge scoring, and calibration (ECE)
  • Regression tracking โ€” diff any two run IDs, get a structured report of what regressed
  • Full run history โ€” every run stored with config, scores, and per-sample outputs
  • Plugin architecture โ€” register custom tasks and metrics in 10 lines
  • Any provider โ€” OpenAI, Groq, Anthropic, vLLM, Ollama via one unified interface
  • FastAPI + CLI + Dashboard โ€” use it however you want

Quickstart

pip install -e .

# Set your provider API key
export GROQ_API_KEY=your_key_here

# Run an eval
llmbench run tasks/sample_qa.json \
    --task open_qa \
    --model groq/llama-3.3-70b-versatile \
    --metrics exact_match,f1,rouge_l

# List all stored runs
llmbench list

# Compare two runs (regression check)
llmbench compare <baseline_run_id> <candidate_run_id>

# Launch dashboard
streamlit run dashboard/app.py

Python API

from llmbench import build_runner, loader, ModelConfig

# Load a dataset (JSON, CSV, HF Hub, or callable)
dataset = loader.load("squad", task_type="open_qa", max_samples=100)

# Build a runner
runner = build_runner(
    ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
    judge_config=ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
)

# Run the eval
result = runner.run(dataset, metrics=["exact_match", "f1", "bertscore", "llm_relevance"])
print(result.aggregate_scores)

# Save and compare
from llmbench.store.db import store, tracker
store.save(result)

# Later: compare to a previous run
report = tracker.compare(baseline_run_id, result.run_id, threshold=0.02)
print(report["regressions"])   # metrics that dropped > 2%

Metric Reference

Metric Family Notes
exact_match Lexical Case/punct normalised
f1 Lexical Token-level F1
rouge_1, rouge_2, rouge_l Lexical ROUGE variants
bleu Lexical Corpus BLEU
bertscore Semantic BERTScore F1
cosine_similarity Semantic all-MiniLM-L6-v2
llm_faithfulness LLM-judge Requires context
llm_relevance LLM-judge Question vs output
llm_coherence LLM-judge Fluency + structure
llm_code_quality LLM-judge Code correctness
ece Calibration Requires confidence scores

Register a custom metric

from llmbench.core.registry import registry

def my_length_metric(results, **_):
    avg_len = sum(len(r.generated_output.split()) for r in results) / len(results)
    return {"avg_output_length": avg_len}

registry.register_metric(
    "avg_output_length",
    "Average word count of generated outputs",
    fn=my_length_metric,
    requires_expected=False,
)

Supported Datasets

Source Example
Local JSON/JSONL loader.load("data.json", task_type="open_qa")
Local CSV/TSV loader.load("data.csv", task_type="open_qa")
HF Hub (preset) loader.load("squad", task_type="open_qa", preset="squad")
HF Hub (custom cols) loader.load("my/repo", task_type="open_qa", input_col="q", output_col="a")
Python callable loader.load(my_generator_fn, task_type="open_qa")

Field aliases are resolved automatically: question/query/prompt โ†’ input, answer/label/target โ†’ expected_output.


Supported Providers

Provider Slug format Notes
Groq groq/llama-3.3-70b-versatile Recommended (fast + free tier)
OpenAI openai/gpt-4o-mini
Anthropic anthropic/claude-3-5-haiku-20241022
vLLM vllm/my-model Set base_url in extra_params
Ollama ollama/llama3 Same as vLLM

CLI Reference

llmbench run      --task --model [--judge] [--metrics] [--max-samples] [--tag key=val]
llmbench list     [--dataset] [--model] [--task] [--limit]
llmbench compare  <baseline_id> <candidate_id> [--threshold]
llmbench show     <run_id> [--samples] [--top]
llmbench metrics  List all registered metrics
llmbench tasks    List all registered task types

Project Structure

llmbench/
โ”œโ”€โ”€ llmbench/
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ schema.py       # EvalSample, EvalDataset, RunResult, ModelConfig
โ”‚   โ”‚   โ”œโ”€โ”€ registry.py     # Task + metric plugin registry
โ”‚   โ”‚   โ””โ”€โ”€ runner.py       # Async batch eval runner
โ”‚   โ”œโ”€โ”€ loaders/            # JSON, CSV, HF Hub, callable loaders
โ”‚   โ”œโ”€โ”€ providers/          # OpenAI, Groq, vLLM provider abstraction
โ”‚   โ”œโ”€โ”€ metrics/            # Lexical, semantic, LLM-judge, calibration
โ”‚   โ”œโ”€โ”€ store/              # SQLAlchemy results store + regression tracker
โ”‚   โ””โ”€โ”€ api/                # FastAPI REST + Typer CLI
โ”œโ”€โ”€ dashboard/              # Streamlit dashboard
โ”œโ”€โ”€ tasks/                  # Built-in task YAML configs + sample datasets
โ””โ”€โ”€ tests/                  # Pytest unit tests

Running Tests

pip install pytest
pytest tests/ -v

All tests run without API keys โ€” providers are mocked at the metric layer.


Roadmap

  • Confidence score extraction from log-probs (for ECE on open-source models)
  • HTML report export (Jinja2 template)
  • GitHub Actions workflow for CI eval gating
  • Multi-turn conversation eval support
  • Async FastAPI REST endpoint

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchkit-0.1.0.tar.gz (45.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_benchkit-0.1.0-py3-none-any.whl (46.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_benchkit-0.1.0.tar.gz.

File metadata

  • Download URL: llm_benchkit-0.1.0.tar.gz
  • Upload date:
  • Size: 45.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for llm_benchkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6fefb3d7d1740a8b5584d1dda729ee9b23af3e1a0b5a3f823f38d7940c3fb9a0
MD5 bb7d375a20a42881ea2b0d7039a3e7ff
BLAKE2b-256 1b265cb5d5c8d099d80529a30c160b3026df6aa5ab9958558f0af3ef3c57f36a

See more details on using hashes here.

File details

Details for the file llm_benchkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_benchkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for llm_benchkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83d89784865da5b24bc5d12d9bff823d83842dbd72e47fe544834cd42293a2ce
MD5 40c5241002ada64e05270be4d8b5c4f2
BLAKE2b-256 8cbf2780f484ea22d3c5ff8878b5da918a59cb8aa207219b0e1d903e111fbd27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page