LLM evaluation harness with custom metrics, LLM-as-judge, and regression tracking
Project description
๐ฌ LLMBench
LLM evaluation harness built from scratch.
LLMBench runs structured, reproducible evaluations across four metric families โ lexical, semantic, LLM-as-judge, and calibration โ stores every run, and diffs any two runs to surface regressions. Plug in any task type or metric in under 10 lines.
Why LLMBench?
Most teams swap models or update prompts and then eyeball outputs to decide if things got better. LLMBench gives you:
- Four metric families โ not just ROUGE, but semantic similarity, LLM-as-judge scoring, and calibration (ECE)
- Regression tracking โ diff any two run IDs, get a structured report of what regressed
- Full run history โ every run stored with config, scores, and per-sample outputs
- Plugin architecture โ register custom tasks and metrics in 10 lines
- Any provider โ OpenAI, Groq, Anthropic, vLLM, Ollama via one unified interface
- FastAPI + CLI + Dashboard โ use it however you want
Quickstart
pip install -e .
# Set your provider API key
export GROQ_API_KEY=your_key_here
# Run an eval
llmbench run tasks/sample_qa.json \
--task open_qa \
--model groq/llama-3.3-70b-versatile \
--metrics exact_match,f1,rouge_l
# List all stored runs
llmbench list
# Compare two runs (regression check)
llmbench compare <baseline_run_id> <candidate_run_id>
# Launch dashboard
streamlit run dashboard/app.py
Python API
from llmbench import build_runner, loader, ModelConfig
# Load a dataset (JSON, CSV, HF Hub, or callable)
dataset = loader.load("squad", task_type="open_qa", max_samples=100)
# Build a runner
runner = build_runner(
ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
judge_config=ModelConfig(provider="groq", model_id="llama-3.3-70b-versatile"),
)
# Run the eval
result = runner.run(dataset, metrics=["exact_match", "f1", "bertscore", "llm_relevance"])
print(result.aggregate_scores)
# Save and compare
from llmbench.store.db import store, tracker
store.save(result)
# Later: compare to a previous run
report = tracker.compare(baseline_run_id, result.run_id, threshold=0.02)
print(report["regressions"]) # metrics that dropped > 2%
Metric Reference
| Metric | Family | Notes |
|---|---|---|
exact_match |
Lexical | Case/punct normalised |
f1 |
Lexical | Token-level F1 |
rouge_1, rouge_2, rouge_l |
Lexical | ROUGE variants |
bleu |
Lexical | Corpus BLEU |
bertscore |
Semantic | BERTScore F1 |
cosine_similarity |
Semantic | all-MiniLM-L6-v2 |
llm_faithfulness |
LLM-judge | Requires context |
llm_relevance |
LLM-judge | Question vs output |
llm_coherence |
LLM-judge | Fluency + structure |
llm_code_quality |
LLM-judge | Code correctness |
ece |
Calibration | Requires confidence scores |
Register a custom metric
from llmbench.core.registry import registry
def my_length_metric(results, **_):
avg_len = sum(len(r.generated_output.split()) for r in results) / len(results)
return {"avg_output_length": avg_len}
registry.register_metric(
"avg_output_length",
"Average word count of generated outputs",
fn=my_length_metric,
requires_expected=False,
)
Supported Datasets
| Source | Example |
|---|---|
| Local JSON/JSONL | loader.load("data.json", task_type="open_qa") |
| Local CSV/TSV | loader.load("data.csv", task_type="open_qa") |
| HF Hub (preset) | loader.load("squad", task_type="open_qa", preset="squad") |
| HF Hub (custom cols) | loader.load("my/repo", task_type="open_qa", input_col="q", output_col="a") |
| Python callable | loader.load(my_generator_fn, task_type="open_qa") |
Field aliases are resolved automatically: question/query/prompt โ input, answer/label/target โ expected_output.
Supported Providers
| Provider | Slug format | Notes |
|---|---|---|
| Groq | groq/llama-3.3-70b-versatile |
Recommended (fast + free tier) |
| OpenAI | openai/gpt-4o-mini |
|
| Anthropic | anthropic/claude-3-5-haiku-20241022 |
|
| vLLM | vllm/my-model |
Set base_url in extra_params |
| Ollama | ollama/llama3 |
Same as vLLM |
CLI Reference
llmbench run --task --model [--judge] [--metrics] [--max-samples] [--tag key=val]
llmbench list [--dataset] [--model] [--task] [--limit]
llmbench compare <baseline_id> <candidate_id> [--threshold]
llmbench show <run_id> [--samples] [--top]
llmbench metrics List all registered metrics
llmbench tasks List all registered task types
Project Structure
llmbench/
โโโ llmbench/
โ โโโ core/
โ โ โโโ schema.py # EvalSample, EvalDataset, RunResult, ModelConfig
โ โ โโโ registry.py # Task + metric plugin registry
โ โ โโโ runner.py # Async batch eval runner
โ โโโ loaders/ # JSON, CSV, HF Hub, callable loaders
โ โโโ providers/ # OpenAI, Groq, vLLM provider abstraction
โ โโโ metrics/ # Lexical, semantic, LLM-judge, calibration
โ โโโ store/ # SQLAlchemy results store + regression tracker
โ โโโ api/ # FastAPI REST + Typer CLI
โโโ dashboard/ # Streamlit dashboard
โโโ tasks/ # Built-in task YAML configs + sample datasets
โโโ tests/ # Pytest unit tests
Running Tests
pip install pytest
pytest tests/ -v
All tests run without API keys โ providers are mocked at the metric layer.
Roadmap
- Confidence score extraction from log-probs (for ECE on open-source models)
- HTML report export (Jinja2 template)
- GitHub Actions workflow for CI eval gating
- Multi-turn conversation eval support
- Async FastAPI REST endpoint
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_benchkit-0.1.0.tar.gz.
File metadata
- Download URL: llm_benchkit-0.1.0.tar.gz
- Upload date:
- Size: 45.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fefb3d7d1740a8b5584d1dda729ee9b23af3e1a0b5a3f823f38d7940c3fb9a0
|
|
| MD5 |
bb7d375a20a42881ea2b0d7039a3e7ff
|
|
| BLAKE2b-256 |
1b265cb5d5c8d099d80529a30c160b3026df6aa5ab9958558f0af3ef3c57f36a
|
File details
Details for the file llm_benchkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_benchkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83d89784865da5b24bc5d12d9bff823d83842dbd72e47fe544834cd42293a2ce
|
|
| MD5 |
40c5241002ada64e05270be4d8b5c4f2
|
|
| BLAKE2b-256 |
8cbf2780f484ea22d3c5ff8878b5da918a59cb8aa207219b0e1d903e111fbd27
|