A framework for evaluating and comparing LLM outputs across models and providers

These details have not been verified by PyPI

Project description

aevyra-verdict

A Python framework for evaluating and comparing LLM outputs across models and providers. Given a dataset of prompts (OpenAI, ShareGPT, or Alpaca format), it runs completions against any combination of models, scores the responses with pluggable metrics, and gives you structured results for comparison.

Use cases

Finding the best model for your use case. Instead of manually testing models one by one, run your actual prompts across all of them at once and get an objective, scored comparison.

Benchmarking an open-source model against closed ones. If you have a target OSS model in mind, measure how it performs against SOTA closed models on your specific workload — and identify exactly where the gap is so you know what to improve.

Install

pip install -e .

This pulls in the SDKs for OpenAI, Anthropic, Google (Gemini), Mistral, and Cohere. You only need API keys for the providers you actually use.

Quick start

# 1. Check which API keys are configured
aevyra-verdict providers

# 2. Compare models on a dataset and save results
aevyra-verdict run dataset.jsonl \
  -m openai/gpt-4o \
  -m anthropic/claude-sonnet-4-20250514 \
  -o results.json

Or use the Python API directly:

from aevyra_verdict import Dataset, EvalRunner, RougeScore, LLMJudge
from aevyra_verdict.providers import get_provider

dataset = Dataset.from_jsonl("examples/sample_data.jsonl")

runner = EvalRunner()
runner.add_provider("openai", "gpt-4o")
runner.add_provider("anthropic", "claude-sonnet-4-20250514")
runner.add_metric(RougeScore())
runner.add_metric(LLMJudge(judge_provider=get_provider("openai", "gpt-4o-mini")))

results = runner.run(dataset)
print(results.compare())

Set your API keys as environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, MISTRAL_API_KEY, COHERE_API_KEY) or pass them directly when adding providers.

How it works

The framework has four layers that compose together:

Dataset reads JSONL files where each line has a messages array (OpenAI chat format), an optional ideal reference answer, and optional metadata for filtering.

Providers wrap each LLM API behind a common interface. The OpenAI message format is the canonical input — each provider translates it to whatever the underlying SDK expects (Anthropic's separate system parameter, Gemini's contents format, etc.) and normalizes the response back into a CompletionResult with text, usage stats, and latency.

Metrics score each response. Three families are supported:

Reference-based (exact match, BLEU, ROUGE) — compare output against a known-good answer
LLM-as-judge — use a separate model to evaluate quality on configurable criteria
Custom — pass any Python function that returns a score

Runner ties it together: models and samples are dispatched concurrently via thread pools. Rate-limit errors (HTTP 429) trigger exponential backoff with jitter before retrying; fatal errors (auth failures, bad requests) are surfaced immediately without burning retry budget. Results land in EvalResults.

flowchart LR
    DS[(Dataset\nJSONL / ShareGPT / Alpaca)]:::data

    subgraph runner["⚡ EvalRunner (concurrent)"]
        direction TB
        MA["Model A"]:::model
        MB["Model B"]:::model
        MC["Model C"]:::model
    end

    subgraph metrics["📊 Metrics"]
        direction TB
        R["ROUGE / BLEU"]:::metric
        J["LLM Judge"]:::metric
        C["Custom Python fn"]:::metric
    end

    OUT["Results\ncomparison table · JSON export"]:::output

    DS --> runner
    runner --> metrics
    metrics --> OUT

    classDef data    fill:#6E3FF3,color:#fff,stroke:none
    classDef model   fill:#9B6BFF,color:#fff,stroke:none
    classDef metric  fill:#3FBFFF,color:#fff,stroke:none
    classDef output  fill:#2ECC71,color:#fff,stroke:none

Usage

Dataset format

Three formats are supported. The format is auto-detected from the first record.

OpenAI (native):

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "ideal": "The capital of France is Paris.",
  "metadata": {"category": "factual", "difficulty": "easy"}
}

ShareGPT (common HuggingFace fine-tuning format):

{
  "conversations": [
    {"from": "human", "value": "What is the capital of France?"},
    {"from": "gpt", "value": "The capital of France is Paris."}
  ]
}

Alpaca (instruction-following datasets):

{
  "instruction": "Translate to French.",
  "input": "Hello, how are you?",
  "output": "Bonjour, comment allez-vous?"
}

messages / conversations / instruction is required. ideal and metadata are optional (or extracted automatically for ShareGPT and Alpaca). Pass format= explicitly to override auto-detection:

dataset = Dataset.from_jsonl("sharegpt_data.jsonl", format="sharegpt")
dataset = Dataset.from_jsonl("alpaca_data.jsonl", format="alpaca")

You can also create datasets inline:

dataset = Dataset.from_list([
    {"messages": [{"role": "user", "content": "Hello"}], "ideal": "Hi there"},
])

Filter by metadata fields:

hard_questions = dataset.filter(difficulty="hard", category="reasoning")

Providers

Five providers are built in:

from aevyra_verdict.providers import get_provider, list_providers

print(list_providers())
# ['anthropic', 'cohere', 'google', 'mistral', 'openai']

# Each provider takes a model name and optional api_key / base_url
provider = get_provider("openai", "gpt-4o", api_key="sk-...")
result = provider.complete([{"role": "user", "content": "Hello"}])
print(result.text, result.latency_ms, result.usage)

The OpenAI provider works with any OpenAI-compatible API (Azure, Together, vLLM, etc.) by passing a base_url.

To add a custom provider, subclass Provider and register it:

from aevyra_verdict.providers import Provider, register_provider

class MyProvider(Provider):
    name = "my_provider"
    def complete(self, messages, temperature=0.0, max_tokens=1024, **kwargs):
        # your implementation
        ...

register_provider("my_provider", MyProvider)

Metrics

Reference-based (requires ideal in the dataset):

from aevyra_verdict import ExactMatch, BleuScore, RougeScore

ExactMatch()                        # case-insensitive by default
ExactMatch(case_sensitive=True)
BleuScore(max_ngram=4)
RougeScore(variant="rougeL")        # also "rouge1", "rouge2"

LLM-as-judge (works with or without ideal):

from aevyra_verdict import LLMJudge
from aevyra_verdict.providers import get_provider

judge = get_provider("anthropic", "claude-sonnet-4-20250514")
LLMJudge(judge_provider=judge)
LLMJudge(judge_provider=judge, criteria="Focus only on factual accuracy.")

The judge scores on a 1–5 scale (normalized to 0.0–1.0) and returns its reasoning.

Custom metrics:

from aevyra_verdict import CustomMetric

def word_count_score(response, ideal=None, **kwargs):
    return min(len(response.split()) / 100, 1.0)

CustomMetric("word_count", word_count_score)

Custom functions return either a float or a dict with at least a "score" key (optionally "reasoning" and any other details).

CLI

After pip install -e ., the aevyra-verdict command is available.

Inspect a dataset

Preview a dataset before running — shows sample count, whether ideals are present, and the first sample. No API calls made.

aevyra-verdict inspect dataset.jsonl

Check configured providers

List all available providers and whether their API keys are set:

aevyra-verdict providers

Specifying models

Pass --model (or -m) once per model, in provider/model format:

aevyra-verdict run dataset.jsonl \
  -m openai/gpt-4o \
  -m anthropic/claude-sonnet-4-20250514 \
  -m google/gemini-2.0-flash

For more than a couple of models, or when you want to reuse a configuration, use a config file instead:

aevyra-verdict run dataset.jsonl --config models.yaml

The config file supports JSON, YAML, and TOML. Each model entry takes provider and model, with optional label, api_key, and base_url:

# models.yaml
models:
  - provider: openai
    model: gpt-4o
    label: gpt-4o

  - provider: anthropic
    model: claude-sonnet-4-20250514
    label: claude-sonnet

  # Local vLLM instance — uses the OpenAI-compatible API
  - provider: openai
    model: meta-llama/Llama-3.1-8B-Instruct
    base_url: http://localhost:8000/v1
    api_key: "none"
    label: llama-local

Start a local vLLM server with: vllm serve meta-llama/Llama-3.1-8B-Instruct

Specifying metrics

Use --metric for built-in options (rouge, bleu, exact) and repeat for multiple:

aevyra-verdict run dataset.jsonl -m openai/gpt-4o --metric rouge --metric bleu

Add an LLM-as-judge with --judge:

aevyra-verdict run dataset.jsonl -m openai/gpt-4o --judge openai/gpt-4o-mini

To customise the judge's evaluation criteria, pass a prompt template file. The recommended format is .md since judge prompts tend to have structure. Use {criteria}, {conversation}, {response}, and {ideal_section} as placeholders:

aevyra-verdict run dataset.jsonl -m openai/gpt-4o \
  --judge openai/gpt-4o-mini \
  --judge-prompt examples/judge_prompt.md

examples/judge_prompt.md is a copy of the default template — a good starting point.

To use a custom Python scoring function, point at a file and name the function:

aevyra-verdict run dataset.jsonl -m openai/gpt-4o \
  --custom-metric examples/custom_metrics.py:brevity_score \
  --custom-metric examples/custom_metrics.py:contains_code

The function receives (response, ideal=None, messages=None) and returns either a float (0.0–1.0) or a dict with a "score" key and optional "reasoning". See examples/custom_metrics.py for three working examples.

Save results to JSON with -o:

aevyra-verdict run dataset.jsonl --config models.yaml -o results.json

Results

results = runner.run(dataset)

# Formatted comparison table
print(results.compare("rouge_rougeL"))

# Summary dict
results.summary()

# Pandas DataFrame
df = results.to_dataframe()

# Export to JSON
results.to_json("eval_results.json")

Configuration

from aevyra_verdict.runner import RunConfig

config = RunConfig(
    temperature=0.0,       # deterministic by default
    max_tokens=1024,

    # Concurrency
    max_workers=10,        # concurrent requests per model
    max_model_workers=4,   # models evaluated concurrently

    # Retries and rate-limit handling
    num_retries=4,         # attempts after the first failure
    retry_base_delay=1.0,  # seconds before the first retry (doubles each attempt)
    retry_max_delay=60.0,  # backoff cap in seconds
    retry_jitter=0.25,     # ±25% random jitter to avoid thundering-herd retries
)
runner = EvalRunner(config=config)

Rate-limit errors (HTTP 429 / RateLimitError) always sleep through the backoff before retrying. Auth and bad-request errors are surfaced immediately — no point retrying a 401. If you're consistently hitting rate limits, the first thing to try is lowering max_workers.

Contributing

This project is in early development. Bug reports and PRs for new providers or metrics are welcome — open an issue first for anything larger than a bug fix.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 14, 2026

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aevyra_verdict-0.1.0.tar.gz (34.1 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aevyra_verdict-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file aevyra_verdict-0.1.0.tar.gz.

File metadata

Download URL: aevyra_verdict-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 34.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevyra_verdict-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`17369882a5bcd3db0d7dc023b4e0be14abe7205b6424813becd1d2ed790cf874`
MD5	`6740d8538dea11888f04c43774717af8`
BLAKE2b-256	`8aa349707e486331996722fb7851f66a9bd02ab59c14dd8dbcb759171f91ebea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevyra_verdict-0.1.0.tar.gz:

Publisher: publish.yml on aevyraai/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevyra_verdict-0.1.0.tar.gz
- Subject digest: 17369882a5bcd3db0d7dc023b4e0be14abe7205b6424813becd1d2ed790cf874
- Sigstore transparency entry: 1200138801
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: aevyraai/verdict@d00d1fa1ad87945cd396d75b53864596cd726d73
- Branch / Tag: refs/heads/main
- Owner: https://github.com/aevyraai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d00d1fa1ad87945cd396d75b53864596cd726d73
- Trigger Event: workflow_dispatch

File details

Details for the file aevyra_verdict-0.1.0-py3-none-any.whl.

File metadata

Download URL: aevyra_verdict-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aevyra_verdict-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39dd6a08bcd5e6930980d69b6e9111053722a390a1aa82db3d874b85c28d9f22`
MD5	`4319597537f4835f2946522370d2f642`
BLAKE2b-256	`ec2a9d1a207ca6c7128ced3fab8794936feee2f4dc1f0efce805a174f90c91fd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aevyra_verdict-0.1.0-py3-none-any.whl:

Publisher: publish.yml on aevyraai/verdict

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aevyra_verdict-0.1.0-py3-none-any.whl
- Subject digest: 39dd6a08bcd5e6930980d69b6e9111053722a390a1aa82db3d874b85c28d9f22
- Sigstore transparency entry: 1200138808
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: aevyraai/verdict@d00d1fa1ad87945cd396d75b53864596cd726d73
- Branch / Tag: refs/heads/main
- Owner: https://github.com/aevyraai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d00d1fa1ad87945cd396d75b53864596cd726d73
- Trigger Event: workflow_dispatch

aevyra-verdict 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

aevyra-verdict

Use cases

Install

Quick start

How it works

Usage

Dataset format

Providers

Metrics

CLI

Inspect a dataset

Check configured providers

Specifying models

Specifying metrics

Results

Configuration

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance