CI for tool-using agents: deterministic eval suites with record/replay, hard assertions, budgets, and PR regression gates.

These details have not been verified by PyPI

Project description

RunLedger

Website: https://runledger.io

CI for tool-using agents. Deterministic eval suites with record/replay tool calls, hard assertions, budgets, and PR regression gates.

RunLedger is a CI harness, not an "eval metrics framework":

DeepEval-style tools help you score behavior.
RunLedger helps you ship safely by making agent tests deterministic and merge-gated.

Why RunLedger (in one minute)

The problem

Agents regress silently. Standard unit tests can't catch:

prompt drift that breaks tool schemas
hallucinated tool arguments
latency spikes or budget overruns
flaky external APIs causing false negatives in CI

The solution

RunLedger stops regressions by shifting from "vibes-based" evaluation to deterministic contracts:

Record & replay: record tool outputs once; replay them in CI for instant, stable tests.
Strict contracts: enforce tool schemas, calling order, and allowlists.
Budget gating: fail PRs automatically if execution time, steps, or costs exceed defined limits.

RunLedger vs evaluation frameworks

Feature	DeepEval / Ragas / TruLens	RunLedger
Primary goal	Scoring quality ("Is this answer helpful?")	Shipping safely ("Did we break the build?")
Tool execution	Often live or mocked manually	Record/replay cassettes (automatic mocking)
Pass/fail criteria	LLM-graded scores (0.0 - 1.0)	Hard contracts (schema, budgets, order)
Best for	Improving prompt quality	Preventing regressions in production

Note: you can use both. Use DeepEval to calculate scores, and wrap your agent in RunLedger to ensure it runs deterministically in CI.

Quickstart (5 minutes)

pipx install runledger
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from <RUN_DIR> --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json

What it looks like in a PR

PR blocked (checks fail):

Why it failed (RunLedger output): Failure reason

Artifacts are written to runledger_out/<suite>/<run_id>/:

report.html, summary.json, junit.xml, run.jsonl

FAQ: Why not just use DeepEval?

DeepEval is excellent for evaluation metrics and benchmarking. RunLedger is focused on CI determinism for tool-using agents:

Record tool calls once -> replay in CI (stable, fast, no flaky external dependencies)
Merge gates based on hard contracts (schema/tool/budgets), not "LLM-judge vibes"
Baselines-as-code (diffs + promotions) that fit cleanly into PR workflows

If you already use DeepEval, keep it - RunLedger can be the harness that makes agent tests CI-grade.

When to use RunLedger vs DeepEval

Use RunLedger when you need:

deterministic CI for tool-using agents (web/APIs/DBs)
hard pass/fail contracts (schema/tool order/budgets)
baselines + PR regression gates

Use DeepEval when you need:

rich evaluation metrics / model-graded scoring / benchmarking workflows

Use both if you want DeepEval scoring inside a deterministic CI harness.

How it works

You define a suite (suite.yaml) and cases (cases/*.yaml).
The runner launches your agent under test as a subprocess (any language).
The agent requests tools via a stdio JSON protocol.
The runner either:
- records tool results to a cassette (local/dev), or
- replays tool results from a cassette (CI/deterministic).
The agent emits a final JSON output.
The runner applies assertions + budgets, compares to a baseline, writes artifacts, and exits non-zero on regressions.

Agent-under-test protocol (language-agnostic)

Transport: newline-delimited JSON messages over stdin/stdout.

Hard rule: agent must write protocol JSON only to stdout. Any human logs must go to stderr (stdout must stay parseable).

Runner -> Agent

task_start

{ "type": "task_start", "task_id": "t1", "input": { "...": "..." } }

tool_result

{ "type": "tool_result", "call_id": "c1", "ok": true, "result": { "...": "..." } }

Agent -> Runner

tool_call

{ "type": "tool_call", "name": "search_docs", "call_id": "c1", "args": { "q": "..." } }

final_output (must be JSON)

{ "type": "final_output", "output": { "category": "billing", "reply": "..." } }

Optional:

log (structured debug)
task_error (explicit failure)

Eval suite format

`evals/<suite>/suite.yaml` (example)

suite_name: support-triage
agent_command: ["python", "agent.py"]
mode: replay            # replay | record | live
cases_path: cases
tool_registry:
  - search_docs
  - create_issue

assertions:
  - type: json_schema
    schema_path: schema.json

budgets:
  max_wall_ms: 20000
  max_tool_calls: 10
  max_tool_errors: 0

baseline_path: baselines/support-triage.json

`evals/<suite>/cases/t1.yaml` (example)

id: t1
description: "triage a login ticket"
input:
  ticket: "User cannot login"
  context:
    plan: "pro"
cassette: cassettes/t1.jsonl

assertions:
  - type: required_fields
    fields: ["category", "reply"]

budgets:
  max_wall_ms: 5000

Record/replay tool calls (cassettes)

Why record/replay?

Agents often depend on external tools (search, DB, HTTP). Live calls in CI are:

slow
flaky
non-deterministic
expensive

Instead:

record once (live tools) -> write cassette
replay in CI (deterministic) -> stable and fast

Cassette format (JSONL example)

Each line is one tool invocation:

{"tool":"search_docs","args":{"q":"reset password"},"ok":true,"result":{"hits":[...]}}
{"tool":"create_issue","args":{"title":"Login issue","priority":"p2"},"ok":true,"result":{"id":"ISSUE-123"}}

Replay matching (MVP):

exact match on tool + canonicalized args
if not found: the case fails with a clear "cassette mismatch" error

Assertions (deterministic)

MVP assertions:

json_schema (validate final output with JSON Schema)
required_fields (keys exist / basic typing)
regex / contains (for specific fields)
tool_contract:
- must call tool X
- must not call tool Y
- X before Y (ordering)

Not default-gating (optional later):

LLM-judge scoring
semantic similarity scoring

Budgets (merge gates)

MVP budgets:

max_wall_ms
max_tool_calls
max_tool_errors

Optional budgets when agents report metrics:

max_tokens_out
max_cost_usd

Baselines + regression checks

On each run, you can compare against a baseline and fail CI if:

success rate drops below threshold
costs spike beyond allowed delta
latency p95 increases beyond allowed delta

Typical workflow:

record cassettes locally
establish a baseline from a known-good run
run replay mode on every PR and gate merges on regressions

runledger baseline promote --from runledger_out/<suite>/<run_id> --to baselines/<suite>.json
runledger run ./evals/<suite> --mode replay --baseline baselines/<suite>.json

Output artifacts

A run produces:

run.jsonl -- append-only event log (steps, tool calls/results, outputs)
summary.json -- suite + case metrics, pass/fail, regression summary
junit.xml -- CI-native pass/fail (each case maps to a test)
report.html -- static shareable report (no server required)

These files are intentionally stable so they can be:

diffed in PRs
uploaded as CI artifacts
ingested by a future hosted add-on

GitHub Actions (example)

name: agent-evals
on:
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run deterministic evals (replay)
        uses: runledger/Runledger@v0.1
        with:
          path: ./evals/demo
          mode: replay

      - name: Upload eval artifacts
        uses: actions/upload-artifact@v4
        with:
          name: agent-eval-artifacts
          path: runledger_out/**

Determinism guide (rules of the road)

Prefer --mode replay in CI.
Ensure agent writes only JSONL protocol messages to stdout; logs go to stderr.
Canonicalize tool call args (stable key ordering, avoid volatile fields).
Avoid timestamps/randomness in final output; if needed, exclude or normalize before assertions.
Keep cassettes safe to commit: redact secrets by default.

Docs

docs/quickstart.md -- install + first run
docs/assertions.md -- assertions, tool contracts, budgets
docs/baselines.md -- baselines, diffing, promotion
docs/ci.md -- CI setup with GitHub Actions
docs/contracts.md -- public contracts (YAML, protocol, artifacts)
docs/troubleshooting.md -- common errors and fixes

Roadmap

init templates for Python (OpenAI SDK), LangGraph/LangChain, and Node/TS
richer budgets (tokens/cost) via optional task_metrics
PR comments bot for regression summaries
plugin system for custom assertions
HTML report trace viewer improvements

Commercial Support

RunLedger is MIT-licensed and free to self-host. For teams that want done-for-you implementation or ongoing maintenance, we offer:

Hardening Sprint (fixed-scope implementation to get deterministic CI running fast)
Assurance (monthly retainer for cassette and case updates, budget tuning, and incident response)

Contact: runledger.io/community.html#contact

Contributing

Issues and PRs welcome.
See CONTRIBUTING.md for development setup, style, and testing.
Security issues: see SECURITY.md.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Dec 26, 2025

0.1.0

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runledger-0.1.1.tar.gz (36.8 kB view details)

Uploaded Dec 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

runledger-0.1.1-py3-none-any.whl (42.9 kB view details)

Uploaded Dec 26, 2025 Python 3

File details

Details for the file runledger-0.1.1.tar.gz.

File metadata

Download URL: runledger-0.1.1.tar.gz
Upload date: Dec 26, 2025
Size: 36.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for runledger-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`342d41cbdd951a717c09ff438a1a0b6a058747c7809edecec0763140b16eb345`
MD5	`057ab93ba26ba451d9813ed4e5e076b1`
BLAKE2b-256	`4e0042e40c32bf01ab2ee0ebdd25eed9d0c735f82a21034b0650cedf384bcf31`

See more details on using hashes here.

Provenance

The following attestation bundles were made for runledger-0.1.1.tar.gz:

Publisher: release.yml on runledger/Runledger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: runledger-0.1.1.tar.gz
- Subject digest: 342d41cbdd951a717c09ff438a1a0b6a058747c7809edecec0763140b16eb345
- Sigstore transparency entry: 780008027
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: runledger/Runledger@2889c82f1825adb05cd37c5d21b6c419d143620e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/runledger
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2889c82f1825adb05cd37c5d21b6c419d143620e
- Trigger Event: workflow_dispatch

File details

Details for the file runledger-0.1.1-py3-none-any.whl.

File metadata

Download URL: runledger-0.1.1-py3-none-any.whl
Upload date: Dec 26, 2025
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for runledger-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07e67aa91dbcaa4c195a104a559dd23d4b009e8d02428620fe493c461d98af21`
MD5	`b0d970afa4e339eb2d28e2fe8f824028`
BLAKE2b-256	`762ac7d38e7b871d56ae06ced141583576c5fa9b61d51631a7385bdeaf1074fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for runledger-0.1.1-py3-none-any.whl:

Publisher: release.yml on runledger/Runledger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: runledger-0.1.1-py3-none-any.whl
- Subject digest: 07e67aa91dbcaa4c195a104a559dd23d4b009e8d02428620fe493c461d98af21
- Sigstore transparency entry: 780008029
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: runledger/Runledger@2889c82f1825adb05cd37c5d21b6c419d143620e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/runledger
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2889c82f1825adb05cd37c5d21b6c419d143620e
- Trigger Event: workflow_dispatch

runledger 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

RunLedger

Why RunLedger (in one minute)

The problem

The solution

RunLedger vs evaluation frameworks

Quickstart (5 minutes)

What it looks like in a PR

FAQ: Why not just use DeepEval?

When to use RunLedger vs DeepEval

How it works

Agent-under-test protocol (language-agnostic)

Runner -> Agent

Agent -> Runner

Eval suite format

evals/<suite>/suite.yaml (example)

evals/<suite>/cases/t1.yaml (example)

Record/replay tool calls (cassettes)

Why record/replay?

Cassette format (JSONL example)

Assertions (deterministic)

Budgets (merge gates)

Baselines + regression checks

Output artifacts

GitHub Actions (example)

Determinism guide (rules of the road)

Docs

Roadmap

Commercial Support

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`evals/<suite>/suite.yaml` (example)

`evals/<suite>/cases/t1.yaml` (example)