Skip to main content

Multi-trial experiment runner for zrb chat

Project description

zrb-llm-evaluator

Multi-trial experiment runner for testing LLMs against structured test cases via zrb chat.

Given a set of models and test cases, it runs every combination across N trials, collects structured results, and generates reports. Supports concurrent execution, timeout handling, resume from partial runs, and pluggable validators.

Quick Start

# Install
poetry install

# Create a test case directory
mkdir -p my-cases/hello-world
cat > my-cases/hello-world/instruction.txt << 'EOF'
Write a Python function that returns "hello, world".
EOF

# Run an experiment (requires zrb installed and configured)
zrb-llm-evaluator run \
  --models openai:gpt-4o \
  --test-cases ./my-cases/hello-world/ \
  --trials 1

Features

  • Runs N models × M test cases × T trials as a flat experiment grid
  • Concurrent trial execution with configurable parallelism (asyncio + semaphore)
  • Per-trial timeout kills subprocess and preserves partial output
  • Resume support: re-run with the same output directory to skip completed cells
  • Pluggable validators: each test case provides a validator.py implementing a typed protocol
  • Atomic results.json writes (temp file + os.rename)
  • Cost summary line parsing from zrb output
  • Custom CLI binary name for white-labeled zrb forks

Installation

Prerequisites: Python ≥ 3.11, Poetry, and an installed zrb CLI with API keys configured.

git clone <repo>
cd zrb-llm-evaluator
poetry install

Verify the CLI works:

poetry run zrb-llm-evaluator --help

Usage

Test Case Format

Each test case is a directory containing:

my-cases/<test-name>/
├── instruction.txt    # Required — the prompt sent to the LLM
├── validator.py       # Required — validation logic (see below)
└── workdir/           # Optional — files copied into the trial's working directory

Writing a Validator

validator.py must expose a top-level validator object that implements ValidatorProtocol:

# my-cases/hello-world/validator.py
from pathlib import Path
from zrb_llm_evaluator.models import ValidationResult, ValidationCheck
from zrb_llm_evaluator.protocols import ValidatorProtocol

class HelloValidator:
    def validate(self, output_dir: Path, log_content: str) -> ValidationResult:
        passed = "hello, world" in log_content.lower()
        return ValidationResult(
            status="PASS" if passed else "FAIL",
            score=1.0 if passed else 0.0,
            details=[
                ValidationCheck(
                    name="contains_greeting",
                    passed=passed,
                    message="Output contains 'hello, world'" if passed
                            else "Missing expected greeting",
                )
            ],
        )

validator = HelloValidator()

The framework validates protocol conformance at load time. validator.py that doesn't implement ValidatorProtocol is rejected before any trial runs.

ValidationResult

Field Type Description
status "EXCELLENT" | "PASS" | "FAIL" Overall outcome
score float (0.0–1.0) Normalized score
details list[ValidationCheck] Per-check breakdown

CLI Reference

run

Run a full experiment.

zrb-llm-evaluator run \
  --models openai:gpt-4o,google-gla:gemini-2.5-flash \
  --test-cases ./cases/bug-fix,./cases/copywriting \
  --trials 3 \
  --parallelism 4 \
  --timeout 300 \
  --cli-name zrb \
  --output-dir ./out
Option Default Description
--models required Comma-separated list in provider:name format
--test-cases required Comma-separated list of test case directory paths
--trials 3 Trials per model × test case cell
--parallelism 4 Max concurrent subprocesses
--timeout 300 Per-trial timeout in seconds
--cli-name zrb CLI binary to invoke
--output-dir ./out Output directory for results

Output: results.json (structured) + report.md (human-readable).

list

List completed trials from a previous experiment.

zrb-llm-evaluator list --dir ./out

report

Re-generate the Markdown report from existing results.json.

zrb-llm-evaluator report --dir ./out

Architecture

The runner has four layers:

  1. CLI (cli.py) — Typer entry point, parses args, validates config
  2. Loader (loader.py) — Discovers test cases from directories, imports validators, checks protocol conformance
  3. Runner (runner.py) — Async subprocess orchestration with asyncio.Semaphore, asyncio.wait_for, and ResumeManager for idempotent resumption. Each trial creates an isolated directory {output}/{model_safe}/{test_case}/trial-{N}/ (colons in the model name are sanitized) with its own history directory
  4. Reporter (reporter.py) — Generates Markdown and JSON output with atomic file writes

Key design decisions are documented in docs/adr/.

Output Structure

./out/
├── results.json          # Structured results (list of TrialResult)
├── report.md             # Human-readable report
├── openai_gpt-4o/
│   └── bug-fix/
│       ├── trial-1/
│       │   ├── stdout.log            # Raw subprocess stdout/stderr
│       │   └── history/              # ZRB_LLM_HISTORY_DIR
│       │       └── <session>.json
│       ├── trial-2/
│       └── trial-3/
└── google-gla_gemini-2.5-flash/
    └── ...

Development

poetry install --with dev
poetry run pytest tests/experiment-runner/
poetry run ruff check
poetry run mypy src/

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zrb_llm_evaluator-0.1.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zrb_llm_evaluator-0.1.2-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file zrb_llm_evaluator-0.1.2.tar.gz.

File metadata

  • Download URL: zrb_llm_evaluator-0.1.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.1 CPython/3.13.0 Darwin/25.5.0

File hashes

Hashes for zrb_llm_evaluator-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b0bffd6b1beac649b710ff4dd1f0d4092a3ca3e3ce1aa53047ec74e4052d7aae
MD5 561ea1fd67641129f56494dffef3a794
BLAKE2b-256 0a02f3bc5c7fdb022a9da00de5b059ccbc5d71e1fafa9117f21e506c58ca0dde

See more details on using hashes here.

File details

Details for the file zrb_llm_evaluator-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: zrb_llm_evaluator-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.1 CPython/3.13.0 Darwin/25.5.0

File hashes

Hashes for zrb_llm_evaluator-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ca828267c407292c3ea4d45f88cc3be241f08b18dea083256d77ac6fae024b1b
MD5 cee61059c1ef75979d7c4c32f09c88a9
BLAKE2b-256 d9df444a17dcae50118e9ed1b8718059fdf3a08f8726f415c3a1d1d89bf9b98b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page