LLM evaluation harness with custom metrics, LLM-as-judge, and regression tracking

Project description

🔬 LLMBench

LLM evaluation harness built from scratch.

LLMBench runs structured, reproducible evaluations across four metric families — lexical, semantic, LLM-as-judge, and calibration — stores every run, and diffs any two runs to surface regressions. Plug in any task type or metric in under 10 lines.

Why LLMBench?

Most teams swap models or update prompts and then eyeball outputs to decide if things got better. LLMBench gives you:

Four metric families — not just ROUGE, but semantic similarity, LLM-as-judge scoring, and calibration (ECE)
Regression tracking — diff any two run IDs, get a structured report of what regressed
Full run history — every run stored with config, scores, and per-sample outputs
Plugin architecture — register custom tasks and metrics in 10 lines
Any provider — OpenAI, Groq, Anthropic, vLLM, Ollama via one unified interface
FastAPI + CLI + Dashboard — use it however you want

Quickstart

pip install -e .

# Set your provider API key
export GROQ_API_KEY=your_key_here

# Run an eval
llmbench run tasks/sample_qa.json \
    --task open_qa \
    --model groq/llama-3.3-70b-versatile \
    --metrics exact_match,f1,rouge_l

# List all stored runs
llmbench list

# Compare two runs (regression check)
llmbench compare <baseline_run_id> <candidate_run_id>

# Launch dashboard
streamlit run dashboard/app.py

Python API

from llmbench import build_runner, loader, ModelConfig

# Load a dataset (JSON, CSV, HF Hub, or callable)
dataset = loader.load("squad", task_type="open_qa", max_samples=100)

# Build a runner
runner = build_runner(
    ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
    judge_config=ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
)

# Run the eval
result = runner.run(dataset, metrics=["exact_match", "f1", "bertscore", "llm_relevance"])
print(result.aggregate_scores)

# Save and compare
from llmbench.store.db import store, tracker
store.save(result)

# Later: compare to a previous run
report = tracker.compare(baseline_run_id, result.run_id, threshold=0.02)
print(report["regressions"])   # metrics that dropped > 2%

Metric Reference

Metric	Family	Notes
`exact_match`	Lexical	Case/punct normalised
`f1`	Lexical	Token-level F1
`rouge_1`, `rouge_2`, `rouge_l`	Lexical	ROUGE variants
`bleu`	Lexical	Corpus BLEU
`bertscore`	Semantic	BERTScore F1
`cosine_similarity`	Semantic	`all-MiniLM-L6-v2`
`llm_faithfulness`	LLM-judge	Requires context
`llm_relevance`	LLM-judge	Question vs output
`llm_coherence`	LLM-judge	Fluency + structure
`llm_code_quality`	LLM-judge	Code correctness
`ece`	Calibration	Requires confidence scores

Register a custom metric

from llmbench.core.registry import registry

def my_length_metric(results, **_):
    avg_len = sum(len(r.generated_output.split()) for r in results) / len(results)
    return {"avg_output_length": avg_len}

registry.register_metric(
    "avg_output_length",
    "Average word count of generated outputs",
    fn=my_length_metric,
    requires_expected=False,
)

Supported Datasets

Source	Example
Local JSON/JSONL	`loader.load("data.json", task_type="open_qa")`
Local CSV/TSV	`loader.load("data.csv", task_type="open_qa")`
HF Hub (preset)	`loader.load("squad", task_type="open_qa", preset="squad")`
HF Hub (custom cols)	`loader.load("my/repo", task_type="open_qa", input_col="q", output_col="a")`
Python callable	`loader.load(my_generator_fn, task_type="open_qa")`

Field aliases are resolved automatically: question/query/prompt → input, answer/label/target → expected_output.

Supported Providers

Provider	Slug format	Notes
Groq	`groq/llama-3.3-70b-versatile`	Recommended (fast + free tier)
OpenAI	`openai/gpt-4o-mini`
Anthropic	`anthropic/claude-3-5-haiku-20241022`
vLLM	`vllm/my-model`	Set `base_url` in extra_params
Ollama	`ollama/llama3`	Same as vLLM

CLI Reference

llmbench run      --task --model [--judge] [--metrics] [--max-samples] [--tag key=val]
llmbench list     [--dataset] [--model] [--task] [--limit]
llmbench compare  <baseline_id> <candidate_id> [--threshold]
llmbench show     <run_id> [--samples] [--top]
llmbench metrics  List all registered metrics
llmbench tasks    List all registered task types

Project Structure

llmbench/
├── llmbench/
│   ├── core/
│   │   ├── schema.py       # EvalSample, EvalDataset, RunResult, ModelConfig
│   │   ├── registry.py     # Task + metric plugin registry
│   │   └── runner.py       # Async batch eval runner
│   ├── loaders/            # JSON, CSV, HF Hub, callable loaders
│   ├── providers/          # OpenAI, Groq, vLLM provider abstraction
│   ├── metrics/            # Lexical, semantic, LLM-judge, calibration
│   ├── store/              # SQLAlchemy results store + regression tracker
│   └── api/                # FastAPI REST + Typer CLI
├── dashboard/              # Streamlit dashboard
├── tasks/                  # Built-in task YAML configs + sample datasets
└── tests/                  # Pytest unit tests

Running Tests

pip install pytest
pytest tests/ -v

All tests run without API keys — providers are mocked at the metric layer.

Roadmap

Confidence score extraction from log-probs (for ECE on open-source models)
HTML report export (Jinja2 template)
GitHub Actions workflow for CI eval gating
Multi-turn conversation eval support
Async FastAPI REST endpoint

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchkit-0.1.0.tar.gz (45.1 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_benchkit-0.1.0-py3-none-any.whl (46.9 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file llm_benchkit-0.1.0.tar.gz.

File metadata

Download URL: llm_benchkit-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 45.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for llm_benchkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6fefb3d7d1740a8b5584d1dda729ee9b23af3e1a0b5a3f823f38d7940c3fb9a0`
MD5	`bb7d375a20a42881ea2b0d7039a3e7ff`
BLAKE2b-256	`1b265cb5d5c8d099d80529a30c160b3026df6aa5ab9958558f0af3ef3c57f36a`

See more details on using hashes here.

File details

Details for the file llm_benchkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_benchkit-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 46.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for llm_benchkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83d89784865da5b24bc5d12d9bff823d83842dbd72e47fe544834cd42293a2ce`
MD5	`40c5241002ada64e05270be4d8b5c4f2`
BLAKE2b-256	`8cbf2780f484ea22d3c5ff8878b5da918a59cb8aa207219b0e1d903e111fbd27`

See more details on using hashes here.

llm-benchkit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🔬 LLMBench

Why LLMBench?

Quickstart

Python API

Metric Reference

Register a custom metric

Supported Datasets

Supported Providers

CLI Reference

Project Structure

Running Tests

Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes