Skip to main content

CI for tool-using agents: deterministic eval suites with record/replay, hard assertions, budgets, and PR regression gates.

Project description

RunLedger

CI PyPI License

CI for tool-using agents. Deterministic eval suites with record/replay tool calls, hard assertions, budgets, and PR regression gates.

RunLedger is a CI harness, not an "eval metrics framework":

  • DeepEval-style tools help you score behavior.
  • RunLedger helps you ship safely by making agent tests deterministic and merge-gated.

Why RunLedger (in one minute)

Agents regress silently:

  • prompt changes
  • tool signature drift
  • model updates
  • external APIs / web volatility

RunLedger stops that by:

  • recording tool calls once and replaying in CI
  • enforcing contracts (schema, tool allowlist/order, budgets)
  • gating PRs via baselines + diffs (exit codes + JUnit + HTML report)

Quickstart (5 minutes)

pipx install runledger
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from <RUN_DIR> --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json

Artifacts are written to runledger_out/<suite>/<run_id>/:

  • report.html, summary.json, junit.xml, run.jsonl

FAQ: Why not just use DeepEval?

DeepEval is excellent for evaluation metrics and benchmarking. RunLedger is focused on CI determinism for tool-using agents:

  • Record tool calls once -> replay in CI (stable, fast, no flaky external dependencies)
  • Merge gates based on hard contracts (schema/tool/budgets), not "LLM-judge vibes"
  • Baselines-as-code (diffs + promotions) that fit cleanly into PR workflows

If you already use DeepEval, keep it - RunLedger can be the harness that makes agent tests CI-grade.


When to use RunLedger vs DeepEval

Use RunLedger when you need:

  • deterministic CI for tool-using agents (web/APIs/DBs)
  • hard pass/fail contracts (schema/tool order/budgets)
  • baselines + PR regression gates

Use DeepEval when you need:

  • rich evaluation metrics / model-graded scoring / benchmarking workflows

Use both if you want DeepEval scoring inside a deterministic CI harness.


How it works

  1. You define a suite (suite.yaml) and cases (cases/*.yaml).

  2. The runner launches your agent under test as a subprocess (any language).

  3. The agent requests tools via a stdio JSON protocol.

  4. The runner either:

    • records tool results to a cassette (local/dev), or
    • replays tool results from a cassette (CI/deterministic).
  5. The agent emits a final JSON output.

  6. The runner applies assertions + budgets, compares to a baseline, writes artifacts, and exits non-zero on regressions.


Agent-under-test protocol (language-agnostic)

Transport: newline-delimited JSON messages over stdin/stdout.

Hard rule: agent must write protocol JSON only to stdout. Any human logs must go to stderr (stdout must stay parseable).

Runner -> Agent

task_start

{ "type": "task_start", "task_id": "t1", "input": { "...": "..." } }

tool_result

{ "type": "tool_result", "call_id": "c1", "ok": true, "result": { "...": "..." } }

Agent -> Runner

tool_call

{ "type": "tool_call", "name": "search_docs", "call_id": "c1", "args": { "q": "..." } }

final_output (must be JSON)

{ "type": "final_output", "output": { "category": "billing", "reply": "..." } }

Optional:

  • log (structured debug)
  • task_error (explicit failure)

Eval suite format

evals/<suite>/suite.yaml (example)

suite_name: support-triage
agent_command: ["python", "agent.py"]
mode: replay            # replay | record | live
cases_path: cases
tool_registry:
  - search_docs
  - create_issue

assertions:
  - type: json_schema
    schema_path: schema.json

budgets:
  max_wall_ms: 20000
  max_tool_calls: 10
  max_tool_errors: 0

baseline_path: baselines/support-triage.json

evals/<suite>/cases/t1.yaml (example)

id: t1
description: "triage a login ticket"
input:
  ticket: "User cannot login"
  context:
    plan: "pro"
cassette: cassettes/t1.jsonl

assertions:
  - type: required_fields
    fields: ["category", "reply"]

budgets:
  max_wall_ms: 5000

Record/replay tool calls (cassettes)

Why record/replay?

Agents often depend on external tools (search, DB, HTTP). Live calls in CI are:

  • slow
  • flaky
  • non-deterministic
  • expensive

Instead:

  • record once (live tools) -> write cassette
  • replay in CI (deterministic) -> stable and fast

Cassette format (JSONL example)

Each line is one tool invocation:

{"tool":"search_docs","args":{"q":"reset password"},"ok":true,"result":{"hits":[...]}}
{"tool":"create_issue","args":{"title":"Login issue","priority":"p2"},"ok":true,"result":{"id":"ISSUE-123"}}

Replay matching (MVP):

  • exact match on tool + canonicalized args
  • if not found: the case fails with a clear "cassette mismatch" error

Assertions (deterministic)

MVP assertions:

  • json_schema (validate final output with JSON Schema)

  • required_fields (keys exist / basic typing)

  • regex / contains (for specific fields)

  • tool_contract:

    • must call tool X
    • must not call tool Y
    • X before Y (ordering)

Not default-gating (optional later):

  • LLM-judge scoring
  • semantic similarity scoring

Budgets (merge gates)

MVP budgets:

  • max_wall_ms
  • max_tool_calls
  • max_tool_errors

Optional budgets when agents report metrics:

  • max_tokens_out
  • max_cost_usd

Baselines + regression checks

On each run, you can compare against a baseline and fail CI if:

  • success rate drops below threshold
  • costs spike beyond allowed delta
  • latency p95 increases beyond allowed delta

Typical workflow:

  • record cassettes locally
  • establish a baseline from a known-good run
  • run replay mode on every PR and gate merges on regressions
runledger baseline promote --from runledger_out/<suite>/<run_id> --to baselines/<suite>.json
runledger run ./evals/<suite> --mode replay --baseline baselines/<suite>.json

Output artifacts

A run produces:

  • run.jsonl -- append-only event log (steps, tool calls/results, outputs)
  • summary.json -- suite + case metrics, pass/fail, regression summary
  • junit.xml -- CI-native pass/fail (each case maps to a test)
  • report.html -- static shareable report (no server required)

These files are intentionally stable so they can be:

  • diffed in PRs
  • uploaded as CI artifacts
  • ingested by a future hosted add-on

GitHub Actions (example)

name: agent-evals
on:
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run deterministic evals (replay)
        uses: runledger/Runledger@v0.1.0
        with:
          path: ./evals/demo
          mode: replay

      - name: Upload eval artifacts
        uses: actions/upload-artifact@v4
        with:
          name: agent-eval-artifacts
          path: runledger_out/**

Determinism guide (rules of the road)

  • Prefer --mode replay in CI.
  • Ensure agent writes only JSONL protocol messages to stdout; logs go to stderr.
  • Canonicalize tool call args (stable key ordering, avoid volatile fields).
  • Avoid timestamps/randomness in final output; if needed, exclude or normalize before assertions.
  • Keep cassettes safe to commit: redact secrets by default.

Docs

  • docs/quickstart.md -- install + first run
  • docs/assertions.md -- assertions, tool contracts, budgets
  • docs/baselines.md -- baselines, diffing, promotion
  • docs/ci.md -- CI setup with GitHub Actions
  • docs/contracts.md -- public contracts (YAML, protocol, artifacts)
  • docs/troubleshooting.md -- common errors and fixes

Roadmap

  • init templates for Python (OpenAI SDK), LangGraph/LangChain, and Node/TS
  • richer budgets (tokens/cost) via optional task_metrics
  • PR comments bot for regression summaries
  • plugin system for custom assertions
  • HTML report trace viewer improvements

Commercial Support

RunLedger is MIT-licensed and free to self-host. For teams that want done-for-you implementation or ongoing maintenance, we offer:

  • Hardening Sprint (fixed-scope implementation to get deterministic CI running fast)
  • Assurance (monthly retainer for cassette and case updates, budget tuning, and incident response)

Contact: runleder.io/community.html#contact


Contributing

  • Issues and PRs welcome.
  • See CONTRIBUTING.md for development setup, style, and testing.
  • Security issues: see SECURITY.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runledger-0.1.0.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

runledger-0.1.0-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file runledger-0.1.0.tar.gz.

File metadata

  • Download URL: runledger-0.1.0.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for runledger-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb8de03f91c67dc2b06389db93ef5136464e4c3aca6a75cfb36e811925f60269
MD5 684e35f17eae30fee5b9cf3973108506
BLAKE2b-256 6b9b755894bdd4e20458f428fa0e10415e3d010b1f28bb80339f7cda1043fbd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for runledger-0.1.0.tar.gz:

Publisher: release.yml on runledger/Runledger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file runledger-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: runledger-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for runledger-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ed8dd60260e2e48ac2223dc3d90a01b02a7913c7744c3e5833c8524be7edae9
MD5 fcf17192977a0272313b95a66d67026a
BLAKE2b-256 5394ff08b027776247464d08f100a5995b47c40ca37d8f68375ce561b0b6468c

See more details on using hashes here.

Provenance

The following attestation bundles were made for runledger-0.1.0-py3-none-any.whl:

Publisher: release.yml on runledger/Runledger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page