Benchmark any LLM against your data. Pick the best model, then make it better.
Project description
aevyra-verdict
Benchmark any LLM against your data. Pick the best model, then make it better.
verdict runs your prompts across any combination of models, scores the responses with pluggable metrics, and gives you a side-by-side comparison — so you can choose the right model for your task, then track whether your prompt engineering or fine-tuning is actually moving the needle.
Use cases
Choosing the right model. Instead of guessing, run your actual prompts across GPT-5.4-mini, Claude Sonnet, Gemini, Llama — and pick the one that scores highest on your specific task.
Measuring improvement. Establish a baseline score, tweak your system prompt or fine-tune your model, re-run verdict. If the number goes up, your change helped. If it doesn't, you know to try something else.
Benchmarking open-source vs closed models. Measure how a local model stacks up against SOTA closed models on your workload — and identify exactly where the gap is.
Install
pip install aevyra-verdict
Provider SDKs are optional extras — install only what you need:
pip install aevyra-verdict[openai] # OpenAI + OpenRouter + local (Ollama/vLLM)
pip install aevyra-verdict[anthropic] # Anthropic
pip install aevyra-verdict[google] # Google Gemini
pip install aevyra-verdict[mistral] # Mistral
pip install aevyra-verdict[cohere] # Cohere
pip install aevyra-verdict[all] # everything
You only need API keys for the providers you actually use.
Quick start
# 1. Check which API keys are configured
aevyra-verdict providers
# 2. Compare models on a dataset and save results
aevyra-verdict run examples/sample_data.jsonl \
-m openai/gpt-5.4-nano \
-m qwen/qwen3.5-9b \
-o results.json
# 3. Compare two local Ollama models (no API key needed)
aevyra-verdict run examples/sample_data.jsonl \
-m local/llama3.1:8b \
-m local/mistral \
--base-url http://localhost:11434/v1 \
-o results.json
Or use the Python API directly:
from aevyra_verdict import Dataset, EvalRunner, RougeScore, LLMJudge
from aevyra_verdict.providers import get_provider
dataset = Dataset.from_jsonl("examples/sample_data.jsonl")
runner = EvalRunner()
runner.add_provider("openai", "gpt-5.4-nano")
runner.add_provider("openrouter", "qwen/qwen3.5-9b")
runner.add_metric(RougeScore())
runner.add_metric(LLMJudge(judge_provider=get_provider("openai", "gpt-5.4")))
results = runner.run(dataset)
print(results.compare())
Set your API keys as environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY,
GOOGLE_API_KEY, MISTRAL_API_KEY, COHERE_API_KEY) or pass them directly when
adding providers.
How it works
The framework has four layers that compose together:
Dataset reads JSONL files where each line has a messages array (OpenAI chat
format), an optional ideal reference answer, and optional metadata for filtering.
Providers wrap each LLM API behind a common interface. The OpenAI message format
is the canonical input — each provider translates it to whatever the underlying SDK
expects (Anthropic's separate system parameter, Gemini's contents format, etc.) and
normalizes the response back into a CompletionResult with text, usage stats, and
latency.
Metrics score each response. Three families are supported:
- Reference-based (exact match, BLEU, ROUGE) — compare output against a known-good answer
- LLM-as-judge — use a separate model to evaluate quality on configurable criteria
- Custom — pass any Python function that returns a score
Runner ties it together: models and samples are dispatched concurrently via
thread pools. Rate-limit errors (HTTP 429) trigger exponential backoff with jitter
before retrying; fatal errors (auth failures, bad requests) are surfaced immediately
without burning retry budget. Results land in EvalResults.
flowchart LR
DS[Dataset]:::data
R[EvalRunner]:::model
M[Metrics]:::metric
OUT[Results]:::output
DS --> R --> M --> OUT
classDef data fill:#6E3FF3,color:#fff,stroke:none
classDef model fill:#9B6BFF,color:#fff,stroke:none
classDef metric fill:#3FBFFF,color:#fff,stroke:none
classDef output fill:#2ECC71,color:#fff,stroke:none
Usage
Dataset format
Four formats are supported. JSONL and CSV files are both accepted.
CSV — simplest format for tabular data. Column names default to input and ideal:
dataset = Dataset.from_csv("data.csv") # input + ideal columns
dataset = Dataset.from_csv("data.csv", input_field="article", output_field="summary") # custom columns
dataset = Dataset.from_csv("data.csv", output_field=None) # label-free
For JSONL, the format is auto-detected from the first record.
OpenAI (native):
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"ideal": "The capital of France is Paris.",
"metadata": {"category": "factual", "difficulty": "easy"}
}
ShareGPT (common HuggingFace fine-tuning format):
{
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."}
]
}
Alpaca (instruction-following datasets):
{
"instruction": "Translate to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
messages / conversations / instruction is required. ideal and metadata are
optional (or extracted automatically for ShareGPT and Alpaca). Pass format= explicitly
to override auto-detection:
dataset = Dataset.from_jsonl("sharegpt_data.jsonl", format="sharegpt")
dataset = Dataset.from_jsonl("alpaca_data.jsonl", format="alpaca")
You can also create datasets inline:
dataset = Dataset.from_list([
{"messages": [{"role": "user", "content": "Hello"}], "ideal": "Hi there"},
])
Filter by metadata fields:
hard_questions = dataset.filter(difficulty="hard", category="reasoning")
Providers
Five providers are built in:
from aevyra_verdict.providers import get_provider, list_providers
print(list_providers())
# ['anthropic', 'cohere', 'google', 'mistral', 'openai']
# Each provider takes a model name and optional api_key / base_url
provider = get_provider("openai", "gpt-5.4-nano", api_key="sk-...")
result = provider.complete([{"role": "user", "content": "Hello"}])
print(result.text, result.latency_ms, result.usage)
The OpenAI provider works with any OpenAI-compatible API (Azure, Together, vLLM,
etc.) by passing a base_url.
To add a custom provider, subclass Provider and register it:
from aevyra_verdict.providers import Provider, register_provider
class MyProvider(Provider):
name = "my_provider"
def complete(self, messages, temperature=0.0, max_tokens=1024, **kwargs):
# your implementation
...
register_provider("my_provider", MyProvider)
Metrics
Reference-based (requires ideal answers in the dataset):
from aevyra_verdict import ExactMatch, BleuScore, RougeScore
ExactMatch() # case-insensitive by default
ExactMatch(case_sensitive=True)
BleuScore(max_ngram=4)
RougeScore(variant="rougeL") # also "rouge1", "rouge2"
Using these on a dataset without ideal answers raises a ValueError upfront — see Label-free evaluation below.
LLM-as-judge (works with or without ideal):
from aevyra_verdict import LLMJudge
from aevyra_verdict.providers import get_provider
judge = get_provider("openai", "gpt-5.4")
LLMJudge(judge_provider=judge)
LLMJudge(judge_provider=judge, criteria="Focus only on factual accuracy.")
The judge scores on a 1–5 scale (normalized to 0.0–1.0) and returns its reasoning in ScoreResult.reasoning.
Score across multiple dimensions in a single API call:
LLMJudge(
judge_provider=judge,
dimensions=["clarity", "accuracy", "conciseness"],
)
# result.score → mean across all dimensions (0.0–1.0)
# result.sub_scores → {"clarity": 0.8, "accuracy": 0.6, "conciseness": 1.0}
Custom metrics:
from aevyra_verdict import CustomMetric
def word_count_score(response, ideal=None, **kwargs):
return min(len(response.split()) / 100, 1.0)
CustomMetric("word_count", word_count_score)
Custom functions return either a float or a dict with at least a "score" key
(optionally "reasoning" and any other details).
Label-free evaluation
When you have no reference answers, use LLMJudge (or a CustomMetric) instead of reference-based metrics. The runner validates this upfront and gives a clear error if you accidentally pair a label-free dataset with ExactMatch, BleuScore, or RougeScore.
# Dataset with no ideal answers
dataset = Dataset.from_jsonl("questions.jsonl")
print(dataset.has_ideals()) # False
judge = get_provider("openai", "gpt-5.4")
runner = EvalRunner()
runner.add_provider("openai", "gpt-5.4-nano")
runner.add_metric(LLMJudge(judge_provider=judge))
results = runner.run(dataset) # works fine — no labels needed
See examples/label_free_eval.py for a complete working example.
CLI
After pip install -e ., the aevyra-verdict command is available.
Inspect a dataset
Preview a dataset before running — shows sample count, whether ideals are present, and the first sample. No API calls made.
aevyra-verdict inspect examples/sample_data.jsonl
Check configured providers
List all available providers and whether their API keys are set:
aevyra-verdict providers
Specifying models
Pass --model (or -m) once per model, in provider/model format:
aevyra-verdict run examples/sample_data.jsonl \
-m openai/gpt-5.4-nano \
-m qwen/qwen3.5-9b \
-m google/gemini-2.0-flash
For more than a couple of models, or when you want to reuse a configuration, use a config file instead:
aevyra-verdict run examples/sample_data.jsonl --config models.yaml
The config file supports JSON, YAML, and TOML. Each model entry takes provider and model, with optional label, api_key, and base_url:
# models.yaml
models:
- provider: openai
model: gpt-5.4-nano
label: gpt-5.4-nano
- provider: openrouter
model: qwen/qwen3.5-9b
label: qwen3.5-9b
# Local vLLM instance — uses the OpenAI-compatible API
- provider: openai
model: meta-llama/Llama-3.1-8B-Instruct
base_url: http://localhost:8000/v1
api_key: "none"
label: llama-local
Start a local vLLM server with: vllm serve meta-llama/Llama-3.1-8B-Instruct
Specifying metrics
Use --metric for built-in options (rouge, bleu, exact) and repeat for multiple:
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano --metric rouge --metric bleu
Add an LLM-as-judge with --judge:
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano --judge openai/gpt-5.4
To customise the judge's evaluation criteria, pass a prompt template file. The recommended format is .md since judge prompts tend to have structure. Use {criteria}, {conversation}, {response}, and {ideal_section} as placeholders:
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano \
--judge openai/gpt-5.4 \
--judge-prompt examples/judge_prompt.md
examples/judge_prompt.md is a copy of the default template — a good starting point.
To use a custom Python scoring function, point at a file and name the function:
aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano \
--custom-metric examples/custom_metrics.py:brevity_score \
--custom-metric examples/custom_metrics.py:contains_code
The function receives (response, ideal=None, messages=None) and returns either a float (0.0–1.0) or a dict with a "score" key and optional "reasoning". See examples/custom_metrics.py for three working examples.
Save results to JSON with -o:
aevyra-verdict run examples/sample_data.jsonl --config models.yaml -o results.json
Results
results = runner.run(dataset)
# Formatted comparison table
print(results.compare("rouge_rougeL"))
# Summary dict
results.summary()
# Pandas DataFrame
df = results.to_dataframe()
# Export to JSON
results.to_json("eval_results.json")
Configuration
from aevyra_verdict.runner import RunConfig
config = RunConfig(
temperature=0.0, # deterministic by default
max_tokens=1024,
# Concurrency
max_workers=10, # concurrent requests per model
max_model_workers=4, # models evaluated concurrently
# Retries and rate-limit handling
num_retries=4, # attempts after the first failure
retry_base_delay=1.0, # seconds before the first retry (doubles each attempt)
retry_max_delay=60.0, # backoff cap in seconds
retry_jitter=0.25, # ±25% random jitter to avoid thundering-herd retries
)
runner = EvalRunner(config=config)
Rate-limit errors (HTTP 429 / RateLimitError) always sleep through the backoff
before retrying. Auth and bad-request errors are surfaced immediately — no point
retrying a 401. If you're consistently hitting rate limits, the first thing to try
is lowering max_workers.
Contributing
Bug reports and PRs are welcome. Open an issue first for anything larger than a bug fix.
Adding a provider — subclass Provider in src/aevyra_verdict/providers/, implement
complete(), and register it with register_provider(). See openai_provider.py as the
reference implementation.
Adding a metric — subclass Metric in src/aevyra_verdict/metrics/, implement
score(), and add it to the exports in metrics/__init__.py. If your metric requires
a reference answer, set requires_ideal = True on the class — the runner will then
raise a clear error when it's used on a label-free dataset. See reference.py for
reference-based metrics and judge.py for LLM-as-judge.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aevyra_verdict-0.2.0.tar.gz.
File metadata
- Download URL: aevyra_verdict-0.2.0.tar.gz
- Upload date:
- Size: 46.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
021d720bb16a04a415efdfecdeac7f45a4fcaa3988310ed3e6b92ec2767e828c
|
|
| MD5 |
34592d3381aeeb962058ee55e166f882
|
|
| BLAKE2b-256 |
2fbc9092b83c17fbe44d6258c9800ee7a0109e424eb379c0e9fbc7c18c3e2678
|
Provenance
The following attestation bundles were made for aevyra_verdict-0.2.0.tar.gz:
Publisher:
publish.yml on aevyraai/verdict
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aevyra_verdict-0.2.0.tar.gz -
Subject digest:
021d720bb16a04a415efdfecdeac7f45a4fcaa3988310ed3e6b92ec2767e828c - Sigstore transparency entry: 1295598153
- Sigstore integration time:
-
Permalink:
aevyraai/verdict@a29798a7903daece5b308fdb9326d737c91b4f41 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/aevyraai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a29798a7903daece5b308fdb9326d737c91b4f41 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file aevyra_verdict-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aevyra_verdict-0.2.0-py3-none-any.whl
- Upload date:
- Size: 49.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a60f2f71915132b5b56f1d7589d4072d4725e7385235ed2084f4daaa47e941fc
|
|
| MD5 |
706ccea49b459ddab6fdc6ad8486c023
|
|
| BLAKE2b-256 |
e47dcd6febf6bb27faa80514bf1d5de8afbb46d94c50587562ef5e01c74f7951
|
Provenance
The following attestation bundles were made for aevyra_verdict-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on aevyraai/verdict
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aevyra_verdict-0.2.0-py3-none-any.whl -
Subject digest:
a60f2f71915132b5b56f1d7589d4072d4725e7385235ed2084f4daaa47e941fc - Sigstore transparency entry: 1295598314
- Sigstore integration time:
-
Permalink:
aevyraai/verdict@a29798a7903daece5b308fdb9326d737c91b4f41 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/aevyraai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a29798a7903daece5b308fdb9326d737c91b4f41 -
Trigger Event:
workflow_dispatch
-
Statement type: