Multi-trial experiment runner for zrb chat

These details have not been verified by PyPI

Project description

zrb-llm-evaluator

Multi-trial experiment runner for testing LLMs against structured test cases via zrb chat.

Given a set of models and test cases, it runs every combination across N trials, collects structured results, and generates reports. Supports concurrent execution, timeout handling, resume from partial runs, and pluggable validators.

Quick Start

# Install
poetry install

# Create a test case directory
mkdir -p my-cases/hello-world
cat > my-cases/hello-world/instruction.txt << 'EOF'
Write a Python function that returns "hello, world".
EOF

# Run an experiment (requires zrb installed and configured)
zrb-llm-evaluator run \
  --models openai:gpt-4o \
  --test-cases ./my-cases/hello-world/ \
  --trials 1

Features

Runs N models × M test cases × T trials as a flat experiment grid
Concurrent trial execution with configurable parallelism (asyncio + semaphore)
Per-trial timeout kills subprocess and preserves partial output
Resume support: re-run with the same output directory to skip completed cells
Pluggable validators: each test case provides a validator.py implementing a typed protocol
Atomic results.json writes (temp file + os.rename)
Cost summary line parsing from zrb output
Custom CLI binary name for white-labeled zrb forks

Installation

Prerequisites: Python ≥ 3.11, Poetry, and an installed zrb CLI with API keys configured.

git clone <repo>
cd zrb-llm-evaluator
poetry install

Verify the CLI works:

poetry run zrb-llm-evaluator --help

Usage

Test Case Format

Each test case is a directory containing:

my-cases/<test-name>/
├── instruction.txt    # Required — the prompt sent to the LLM
├── validator.py       # Required — validation logic (see below)
└── workdir/           # Optional — files copied into the trial's working directory

Writing a Validator

validator.py must expose a top-level validator object that implements ValidatorProtocol:

# my-cases/hello-world/validator.py
from pathlib import Path
from zrb_llm_evaluator.models import ValidationResult, ValidationCheck
from zrb_llm_evaluator.protocols import ValidatorProtocol

class HelloValidator:
    def validate(self, output_dir: Path, log_content: str) -> ValidationResult:
        passed = "hello, world" in log_content.lower()
        return ValidationResult(
            status="PASS" if passed else "FAIL",
            score=1.0 if passed else 0.0,
            details=[
                ValidationCheck(
                    name="contains_greeting",
                    passed=passed,
                    message="Output contains 'hello, world'" if passed
                            else "Missing expected greeting",
                )
            ],
        )

validator = HelloValidator()

The framework validates protocol conformance at load time. validator.py that doesn't implement ValidatorProtocol is rejected before any trial runs.

ValidationResult

Field	Type	Description
`status`	`"EXCELLENT" \| "PASS" \| "FAIL"`	Overall outcome
`score`	`float` (0.0–1.0)	Normalized score
`details`	`list[ValidationCheck]`	Per-check breakdown

CLI Reference

`run`

Run a full experiment.

zrb-llm-evaluator run \
  --models openai:gpt-4o,google-gla:gemini-2.5-flash \
  --test-cases ./cases/bug-fix,./cases/copywriting \
  --trials 3 \
  --parallelism 4 \
  --timeout 300 \
  --cli-name zrb \
  --output-dir ./out

Option	Default	Description
`--models`	required	Comma-separated list in `provider:name` format
`--test-cases`	required	Comma-separated list of test case directory paths
`--trials`	`3`	Trials per model × test case cell
`--parallelism`	`4`	Max concurrent subprocesses
`--timeout`	`300`	Per-trial timeout in seconds
`--cli-name`	`zrb`	CLI binary to invoke
`--output-dir`	`./out`	Output directory for results

Output: results.json (structured) + report.md (human-readable).

`list`

List completed trials from a previous experiment.

zrb-llm-evaluator list --dir ./out

`report`

Re-generate the Markdown report from existing results.json.

zrb-llm-evaluator report --dir ./out

Architecture

The runner has four layers:

CLI (cli.py) — Typer entry point, parses args, validates config
Loader (loader.py) — Discovers test cases from directories, imports validators, checks protocol conformance
Runner (runner.py) — Async subprocess orchestration with asyncio.Semaphore, asyncio.wait_for, and ResumeManager for idempotent resumption. Each trial creates an isolated directory {output}/{model_safe}/{test_case}/trial-{N}/ (colons in the model name are sanitized) with its own history directory
Reporter (reporter.py) — Generates Markdown and JSON output with atomic file writes

Key design decisions are documented in docs/adr/.

Output Structure

./out/
├── results.json          # Structured results (list of TrialResult)
├── report.md             # Human-readable report
├── openai_gpt-4o/
│   └── bug-fix/
│       ├── trial-1/
│       │   ├── stdout.log            # Raw subprocess stdout/stderr
│       │   └── history/              # ZRB_LLM_HISTORY_DIR
│       │       └── <session>.json
│       ├── trial-2/
│       └── trial-3/
└── google-gla_gemini-2.5-flash/
    └── ...

Development

poetry install --with dev
poetry run pytest tests/experiment-runner/
poetry run ruff check
poetry run mypy src/

License

MIT.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

May 22, 2026

0.1.5

May 22, 2026

0.1.4

May 20, 2026

0.1.3

May 20, 2026

This version

0.1.2

May 20, 2026

0.1.1

May 20, 2026

0.1.0

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zrb_llm_evaluator-0.1.2.tar.gz (16.8 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zrb_llm_evaluator-0.1.2-py3-none-any.whl (19.8 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file zrb_llm_evaluator-0.1.2.tar.gz.

File metadata

Download URL: zrb_llm_evaluator-0.1.2.tar.gz
Upload date: May 20, 2026
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.1 CPython/3.13.0 Darwin/25.5.0

File hashes

Hashes for zrb_llm_evaluator-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b0bffd6b1beac649b710ff4dd1f0d4092a3ca3e3ce1aa53047ec74e4052d7aae`
MD5	`561ea1fd67641129f56494dffef3a794`
BLAKE2b-256	`0a02f3bc5c7fdb022a9da00de5b059ccbc5d71e1fafa9117f21e506c58ca0dde`

See more details on using hashes here.

File details

Details for the file zrb_llm_evaluator-0.1.2-py3-none-any.whl.

File metadata

Download URL: zrb_llm_evaluator-0.1.2-py3-none-any.whl
Upload date: May 20, 2026
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.1 CPython/3.13.0 Darwin/25.5.0

File hashes

Hashes for zrb_llm_evaluator-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca828267c407292c3ea4d45f88cc3be241f08b18dea083256d77ac6fae024b1b`
MD5	`cee61059c1ef75979d7c4c32f09c88a9`
BLAKE2b-256	`d9df444a17dcae50118e9ed1b8718059fdf3a08f8726f415c3a1d1d89bf9b98b`

See more details on using hashes here.

zrb-llm-evaluator 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

zrb-llm-evaluator

Quick Start

Features

Installation

Usage

Test Case Format

Writing a Validator

ValidationResult

CLI Reference

`run`

`list`

`report`

Architecture

Output Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes