CI for tool-using agents: deterministic eval suites with record/replay, hard assertions, budgets, and PR regression gates.
Project description
RunLedger
CI for tool-using agents. Deterministic eval suites with record/replay tool calls, hard assertions, budgets, and PR regression gates.
RunLedger is a CI harness, not an "eval metrics framework":
- DeepEval-style tools help you score behavior.
- RunLedger helps you ship safely by making agent tests deterministic and merge-gated.
Why RunLedger (in one minute)
Agents regress silently:
- prompt changes
- tool signature drift
- model updates
- external APIs / web volatility
RunLedger stops that by:
- recording tool calls once and replaying in CI
- enforcing contracts (schema, tool allowlist/order, budgets)
- gating PRs via baselines + diffs (exit codes + JUnit + HTML report)
Quickstart (5 minutes)
pipx install runledger
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from <RUN_DIR> --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json
Artifacts are written to runledger_out/<suite>/<run_id>/:
report.html,summary.json,junit.xml,run.jsonl
FAQ: Why not just use DeepEval?
DeepEval is excellent for evaluation metrics and benchmarking. RunLedger is focused on CI determinism for tool-using agents:
- Record tool calls once -> replay in CI (stable, fast, no flaky external dependencies)
- Merge gates based on hard contracts (schema/tool/budgets), not "LLM-judge vibes"
- Baselines-as-code (diffs + promotions) that fit cleanly into PR workflows
If you already use DeepEval, keep it - RunLedger can be the harness that makes agent tests CI-grade.
When to use RunLedger vs DeepEval
Use RunLedger when you need:
- deterministic CI for tool-using agents (web/APIs/DBs)
- hard pass/fail contracts (schema/tool order/budgets)
- baselines + PR regression gates
Use DeepEval when you need:
- rich evaluation metrics / model-graded scoring / benchmarking workflows
Use both if you want DeepEval scoring inside a deterministic CI harness.
How it works
-
You define a suite (
suite.yaml) and cases (cases/*.yaml). -
The runner launches your agent under test as a subprocess (any language).
-
The agent requests tools via a stdio JSON protocol.
-
The runner either:
- records tool results to a cassette (local/dev), or
- replays tool results from a cassette (CI/deterministic).
-
The agent emits a final JSON output.
-
The runner applies assertions + budgets, compares to a baseline, writes artifacts, and exits non-zero on regressions.
Agent-under-test protocol (language-agnostic)
Transport: newline-delimited JSON messages over stdin/stdout.
Hard rule: agent must write protocol JSON only to stdout. Any human logs must go to stderr (stdout must stay parseable).
Runner -> Agent
task_start
{ "type": "task_start", "task_id": "t1", "input": { "...": "..." } }
tool_result
{ "type": "tool_result", "call_id": "c1", "ok": true, "result": { "...": "..." } }
Agent -> Runner
tool_call
{ "type": "tool_call", "name": "search_docs", "call_id": "c1", "args": { "q": "..." } }
final_output (must be JSON)
{ "type": "final_output", "output": { "category": "billing", "reply": "..." } }
Optional:
log(structured debug)task_error(explicit failure)
Eval suite format
evals/<suite>/suite.yaml (example)
suite_name: support-triage
agent_command: ["python", "agent.py"]
mode: replay # replay | record | live
cases_path: cases
tool_registry:
- search_docs
- create_issue
assertions:
- type: json_schema
schema_path: schema.json
budgets:
max_wall_ms: 20000
max_tool_calls: 10
max_tool_errors: 0
baseline_path: baselines/support-triage.json
evals/<suite>/cases/t1.yaml (example)
id: t1
description: "triage a login ticket"
input:
ticket: "User cannot login"
context:
plan: "pro"
cassette: cassettes/t1.jsonl
assertions:
- type: required_fields
fields: ["category", "reply"]
budgets:
max_wall_ms: 5000
Record/replay tool calls (cassettes)
Why record/replay?
Agents often depend on external tools (search, DB, HTTP). Live calls in CI are:
- slow
- flaky
- non-deterministic
- expensive
Instead:
- record once (live tools) -> write cassette
- replay in CI (deterministic) -> stable and fast
Cassette format (JSONL example)
Each line is one tool invocation:
{"tool":"search_docs","args":{"q":"reset password"},"ok":true,"result":{"hits":[...]}}
{"tool":"create_issue","args":{"title":"Login issue","priority":"p2"},"ok":true,"result":{"id":"ISSUE-123"}}
Replay matching (MVP):
- exact match on
tool+ canonicalizedargs - if not found: the case fails with a clear "cassette mismatch" error
Assertions (deterministic)
MVP assertions:
-
json_schema(validate final output with JSON Schema) -
required_fields(keys exist / basic typing) -
regex/contains(for specific fields) -
tool_contract:- must call tool X
- must not call tool Y
- X before Y (ordering)
Not default-gating (optional later):
- LLM-judge scoring
- semantic similarity scoring
Budgets (merge gates)
MVP budgets:
max_wall_msmax_tool_callsmax_tool_errors
Optional budgets when agents report metrics:
max_tokens_outmax_cost_usd
Baselines + regression checks
On each run, you can compare against a baseline and fail CI if:
- success rate drops below threshold
- costs spike beyond allowed delta
- latency p95 increases beyond allowed delta
Typical workflow:
- record cassettes locally
- establish a baseline from a known-good run
- run replay mode on every PR and gate merges on regressions
runledger baseline promote --from runledger_out/<suite>/<run_id> --to baselines/<suite>.json
runledger run ./evals/<suite> --mode replay --baseline baselines/<suite>.json
Output artifacts
A run produces:
run.jsonl-- append-only event log (steps, tool calls/results, outputs)summary.json-- suite + case metrics, pass/fail, regression summaryjunit.xml-- CI-native pass/fail (each case maps to a test)report.html-- static shareable report (no server required)
These files are intentionally stable so they can be:
- diffed in PRs
- uploaded as CI artifacts
- ingested by a future hosted add-on
GitHub Actions (example)
name: agent-evals
on:
pull_request:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run deterministic evals (replay)
uses: runledger/Runledger@v0.1.0
with:
path: ./evals/demo
mode: replay
- name: Upload eval artifacts
uses: actions/upload-artifact@v4
with:
name: agent-eval-artifacts
path: runledger_out/**
Determinism guide (rules of the road)
- Prefer
--mode replayin CI. - Ensure agent writes only JSONL protocol messages to stdout; logs go to stderr.
- Canonicalize tool call args (stable key ordering, avoid volatile fields).
- Avoid timestamps/randomness in final output; if needed, exclude or normalize before assertions.
- Keep cassettes safe to commit: redact secrets by default.
Docs
docs/quickstart.md-- install + first rundocs/assertions.md-- assertions, tool contracts, budgetsdocs/baselines.md-- baselines, diffing, promotiondocs/ci.md-- CI setup with GitHub Actionsdocs/contracts.md-- public contracts (YAML, protocol, artifacts)docs/troubleshooting.md-- common errors and fixes
Roadmap
inittemplates for Python (OpenAI SDK), LangGraph/LangChain, and Node/TS- richer budgets (tokens/cost) via optional
task_metrics - PR comments bot for regression summaries
- plugin system for custom assertions
- HTML report trace viewer improvements
Commercial Support
RunLedger is MIT-licensed and free to self-host. For teams that want done-for-you implementation or ongoing maintenance, we offer:
- Hardening Sprint (fixed-scope implementation to get deterministic CI running fast)
- Assurance (monthly retainer for cassette and case updates, budget tuning, and incident response)
Contact: runleder.io/community.html#contact
Contributing
- Issues and PRs welcome.
- See
CONTRIBUTING.mdfor development setup, style, and testing. - Security issues: see
SECURITY.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file runledger-0.1.0.tar.gz.
File metadata
- Download URL: runledger-0.1.0.tar.gz
- Upload date:
- Size: 35.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb8de03f91c67dc2b06389db93ef5136464e4c3aca6a75cfb36e811925f60269
|
|
| MD5 |
684e35f17eae30fee5b9cf3973108506
|
|
| BLAKE2b-256 |
6b9b755894bdd4e20458f428fa0e10415e3d010b1f28bb80339f7cda1043fbd2
|
Provenance
The following attestation bundles were made for runledger-0.1.0.tar.gz:
Publisher:
release.yml on runledger/Runledger
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
runledger-0.1.0.tar.gz -
Subject digest:
fb8de03f91c67dc2b06389db93ef5136464e4c3aca6a75cfb36e811925f60269 - Sigstore transparency entry: 770685539
- Sigstore integration time:
-
Permalink:
runledger/Runledger@d1da6e085955201c3133f730480bdd3ec44ea947 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/runledger
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1da6e085955201c3133f730480bdd3ec44ea947 -
Trigger Event:
push
-
Statement type:
File details
Details for the file runledger-0.1.0-py3-none-any.whl.
File metadata
- Download URL: runledger-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ed8dd60260e2e48ac2223dc3d90a01b02a7913c7744c3e5833c8524be7edae9
|
|
| MD5 |
fcf17192977a0272313b95a66d67026a
|
|
| BLAKE2b-256 |
5394ff08b027776247464d08f100a5995b47c40ca37d8f68375ce561b0b6468c
|
Provenance
The following attestation bundles were made for runledger-0.1.0-py3-none-any.whl:
Publisher:
release.yml on runledger/Runledger
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
runledger-0.1.0-py3-none-any.whl -
Subject digest:
0ed8dd60260e2e48ac2223dc3d90a01b02a7913c7744c3e5833c8524be7edae9 - Sigstore transparency entry: 770685540
- Sigstore integration time:
-
Permalink:
runledger/Runledger@d1da6e085955201c3133f730480bdd3ec44ea947 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/runledger
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d1da6e085955201c3133f730480bdd3ec44ea947 -
Trigger Event:
push
-
Statement type: