Skip to main content

A boring, config-driven harness for evaluating AI systems.

Project description

eval-harness

CI PyPI PyPI - Downloads License: MIT Python 3.11+

Run your AI system against a dataset. Capture traces. Score with evaluators. Compare runs. Ship.

eval-harness is a config-driven evaluation framework for AI systems — agents, RAG pipelines, code-modifying tools, multi-turn assistants, raw LLM endpoints. You describe the run in one YAML file. The harness dispatches the matrix of cases × variants, captures structured traces, runs evaluators, persists results, and produces a comparable summary.

It is not a benchmark suite. It is the harness that runs your benchmarks.

📦 On PyPI: https://pypi.org/project/eval-harness/ 📚 Source: https://github.com/regokan/evalh


Install

pip install eval-harness                            # core
pip install 'eval-harness[anthropic]'               # + Claude LLM-judge backend
pip install 'eval-harness[anthropic,langfuse,otel]' # + observability mirrors
pip install 'eval-harness[all]'                     # everything

Python 3.11+. Core install pulls only Pydantic v2, httpx, click, rich, jsonpath-ng, pyyaml, python-dotenv, jsonschema, fsspec. LLM SDKs and platform clients ship as optional extras so you only pay for what you use.


60-second tour

# 1. Install
pip install 'eval-harness[anthropic]'

# 2. Drop your Anthropic key into the smoke fixture
echo "ANTHROPIC_API_KEY=sk-ant-..." > examples/tiny_demo/.env

# 3. Run
evalh run examples/tiny_demo/eval.yaml

# 4. Inspect a case
evalh inspect runs/<run_id> --case tiny_demo_001

# 5. Compare two runs
evalh compare runs/<run_a> runs/<run_b>

That run produces a runs/<run_id>/ directory with:

config.yaml         # exact config used, secrets masked
traces.jsonl        # one Trace per (case, variant)
results.jsonl       # one EvaluationResult per (case, variant, evaluator)
summary.yaml        # per-variant pass-rates + baseline comparison
report.md           # human-readable summary

These four files are the durable surface. Everything else (drift reports, inspect output, webhook posts) is derived from them.


What it does

Concern What you get
System under test Plug in any HTTP service, Python function, CLI subprocess, branch checkout, Docker image, multi-turn user simulator, or replay-from-historical-trace.
Evaluators 13 built-ins covering text checks, tool-call assertions, LLM-as-judge (nl-assertions + rubric), schema validation, latency/cost gates, thinking-token rules, semantic similarity, git-diff checks, command exit codes. Plus a clean extension API.
Trace storage Local JSONL (default), SQLite, Postgres, OTel collector, Langfuse, Phoenix, Arize, Braintrust, Slack / Discord / Linear webhooks.
Dataset sources YAML, JSONL, plus production-traffic pulls from Langfuse, Phoenix, Arize, Helicone, Braintrust (with embed_full_trace for replay-style evaluation).
Workspace isolation Tempdir snapshot (default), git worktree, Docker volume (sandboxed; can't read host $HOME).
Variants One run dispatches N × M of cases × system configurations. Use for A/B testing, fleet evals, branch comparison, stochastic sampling.
Distribution Local async (default), Ray, Modal, Celery, Kubernetes Jobs. Same code, different executor.
Drift detection Promote a run as the baseline; evalh drift surfaces regressions vs. baseline. Wire to Slack via webhook on a daily cron.
Reports Markdown summary, baseline ComparisonReport, per-evaluator rollup, regressions/improvements case-by-case.

CLI

evalh run <eval.yaml>                              # execute an eval
evalh run --retry-only-failed <run_dir>            # re-run cells that errored
evalh re-evaluate <run_dir> [--add <evaluator>]    # re-score existing traces offline
evalh inspect <run_dir> [--case <id>] [--failed]   # view a case + its results + filesystem artifacts
evalh compare <run_a> <run_b>                      # diff two runs (regressions / improvements)
evalh promote <run_dir>                            # mark a run as the eval's baseline
evalh drift <run_dir> [--exit-nonzero-on-regression]  # compare against baseline; CI gate

eval.yaml in 30 lines

eval:
  name: listing_price

dataset:
  type: yaml
  path: cases.yaml

systems:                                   # one entry per variant
  - name: agent_main
    adapter: http
    endpoint: http://localhost:8000/chat
    response_mapping:
      final_answer: $.answer
      tool_calls: $.tool_calls
  - name: agent_experimental
    adapter: http
    endpoint: http://localhost:8000/chat
    query_params: { variant: experimental }
    response_mapping: { final_answer: $.answer, tool_calls: $.tool_calls }

evaluators:
  - name: must_call_listing_tool
    type: tool_called
    config: { tool_name: get_listing_details }
  - name: answer_quality
    type: llm_judge
    config:
      model: claude-4-7
      nl_assertions:
        - "The answer mentions the listing's suburb."
        - "The answer compares the listing price to the suburb average."
      pass_when: all

run:
  max_concurrency: 4
  baseline_variant: agent_main
  cost_limit_usd: 5.00

output:
  - { type: local_files, path: runs/ }

See docs/ConfigSchema.md for every field.


Examples

Four runnable references live under examples/:

  • tiny_demo/ — self-contained smoke test against Claude. Needs only ANTHROPIC_API_KEY. Finishes in under a minute.
  • listing_price/ — realistic-shape eval: HTTP agent service, two variants, LLM judge. Plug your service in.
  • online_eval/ — replay-style evaluation. The fixture adapter ships embedded historical traces; the replay SystemAdapter scores them. Swap the fixture for Langfuse / Phoenix / Arize to score production traffic.
  • coding_agent/ — workspace-mutating agent. Claude patches a fixture repo; the command evaluator runs pytest in the artifact directory.

Distributed runs

The default LocalExecutor uses asyncio.gather + a semaphore — perfect for thousands of cases on one box. For larger fleets, plug in another executor:

run:
  executor:
    type: ray            # or modal, celery, kubernetes
    address: auto        # or your cluster address
    object_store_memory: 2_147_483_648

The cell is the unit of distribution. Workers rebuild adapters + evaluators from your eval.yaml and the entry-point registry — config travels, code doesn't, so your custom evaluators work on Ray workers without pickling pitfalls. See docs/Executors.md.


Custom evaluators

When the built-ins don't cover your domain (e.g., "the SQL the agent generated returns the same rowset as the reference SQL"), write your own and register it via Python entry-points — no fork of eval-harness required:

# your-package/pyproject.toml
[project.entry-points."eval_harness.evaluators"]
sql_equivalent = "your_package.evaluators.sql_equivalent:SqlEquivalentEvaluator"
# your eval.yaml
evaluators:
  - name: query_correctness
    type: sql_equivalent
    config: { reference_sql: "SELECT id FROM listings WHERE suburb='Richmond'" }

The same extension pattern works for system adapters, dataset adapters, trace stores, workspace adapters, embedder backends, and LLM-judge backends. See docs/Evaluators.md and docs/Adapters.md.


Observability integrations

eval-harness coexists with your existing observability stack — it doesn't replace it. The local runs/<run_id>/ directory stays canonical; remote sinks are mirrors. Failed mirror writes don't abort the run, they land in summary.yaml > sink_errors.

output:
  - { type: local_files, path: runs/ }                                        # canonical
  - { type: otel,        endpoint: "https://api.honeycomb.io" }               # mirror to Honeycomb
  - { type: langfuse,    api_key: "${LANGFUSE_API_KEY}", host: "..." }        # mirror to Langfuse UI
  - { type: webhook,     platform: slack, url: "${SLACK_WEBHOOK_URL}" }       # daily summary post

Backends shipped: OTel (Honeycomb / Datadog / Tempo / Grafana / Phoenix-OTLP / self-hosted Langfuse), Langfuse, Phoenix, Arize, Braintrust, Helicone (dataset only), Slack / Discord / Linear (webhook). See docs/Observability.md.


CI integrations

Two reference workflows live under templates/:

  • templates/eval.yml — on every PR, run the eval against the PR head, compare with main's baseline, post a markdown summary back to the PR comments.
  • templates/eval-daily.yml — on a schedule (or workflow_dispatch), run the eval, compute drift vs. the saved baseline, and post regressions to a webhook channel.

Walkthrough in docs/CI.md.


Documentation

Topic Doc
Why the project exists docs/PRD.md
End-to-end design docs/Architecture.md
Trace / Case / Result / Summary models docs/DataModel.md
eval.yaml and cases.yaml field reference docs/ConfigSchema.md
System / Dataset / TraceStore / Workspace / Enricher contracts docs/Adapters.md
Built-in evaluators + writing your own docs/Evaluators.md
The variant matrix (A/B, branch, fleet, sampling) docs/Variants.md
Filesystem artifacts + sandboxed workspaces docs/Filesystem.md
Concurrency model + executor abstraction docs/Concurrency.md
Distributed executors (Ray, Modal, Celery, K8s) docs/Executors.md
Observability platform integrations docs/Observability.md
Drift detection + CLI surface docs/CLI.md
GitHub Actions recipes docs/CI.md
Project layout + plugin packaging docs/RepositoryStructure.md
Milestone-by-milestone history CHANGELOG.md, docs/Roadmap.md

Status

All planned milestones are shipped — v0 through v2. The project covers what the roadmap set out to do and nothing beyond it (hosted SaaS, web dashboard, auth, and built-in dataset libraries are explicitly out of scope; see docs/Roadmap.md > Forever-maybe).

Snapshot: 132 source files · 657+ tests · ruff & mypy --strict clean · 6 adapter families · 5 executor backends · 8 observability platform integrations.


Contributing

Issues and PRs welcome. See CONTRIBUTING.md for setup, testing, and submission guidelines. The architectural rails the project is built against live under .claude/rules/ — read those before substantive PRs.


License

MIT. Copyright © 2026 eval-harness contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_harness-2.0.1.tar.gz (160.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_harness-2.0.1-py3-none-any.whl (250.8 kB view details)

Uploaded Python 3

File details

Details for the file eval_harness-2.0.1.tar.gz.

File metadata

  • Download URL: eval_harness-2.0.1.tar.gz
  • Upload date:
  • Size: 160.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.12

File hashes

Hashes for eval_harness-2.0.1.tar.gz
Algorithm Hash digest
SHA256 14ed055e86a6655fd774957705cc4cf522c91a962d8faf919197ae0d0750ce57
MD5 7afed012db8f299a011216be59452e4f
BLAKE2b-256 02fe2456bbd6390504ac387af5bd00824d32be1eac08ff950e437d7ad51407b7

See more details on using hashes here.

File details

Details for the file eval_harness-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for eval_harness-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73c400acde5cdc0352d1ad05aa1ab94007edf494dd4abaff751284cf4d2dae12
MD5 5fd8bee8b017e16e83ef4606f20a4e4f
BLAKE2b-256 1f80a54e43226a83fa05e7246b39114d4a16931033effed1e2290306bd232f74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page