Trace-to-Eval Compiler

Project description

traceval: Trace-to-Eval Compiler

Python Version License

"Your traces already know how your agent fails. traceval turns them into the test suite you never wrote."

Teams running LLM agents in production have observability traces, but only a fraction maintain evals. The raw material for good tests, thousands of real traces full of edge cases and errors, sits unused because turning it into a regression suite is manual and tedious.

traceval ingests agent traces from standard sources, normalizes them into a canonical Pydantic model, labels outcomes, clusters task shapes, and compiles the result into a human-editable eval suite: YAML cases, a pytest harness, and judge rubric scaffolds.

Failure-cluster coverage report generated by traceval analyze

Quickstart

pip install traceval
traceval demo
open traceval-demo/analysis/report.html   # xdg-open on Linux

traceval demo runs the entire loop against a built-in demo agent: it generates 200 synthetic traces, ingests them, clusters the failures, compiles an eval suite, and then proves the headline claim by running that suite twice:

=== Demo complete: healthy agent PASSED, buggy agent FAILED ===
Failure-cluster report: traceval-demo/analysis/report.html
Run report: traceval-demo/evals/runs/run_20260702T072029851802Z.json
Run report: traceval-demo/evals/runs/run_20260702T072030171406Z.json

Re-run any stage manually:
  traceval ingest traceval-demo/synthetic_traces.jsonl -o traceval-demo/traces.db
  traceval analyze traceval-demo/traces.db -o traceval-demo/analysis
  traceval generate traceval-demo/traces.db -o traceval-demo/evals --include-failures
  traceval run traceval-demo/evals --target traceval.demo.agent:invoke_agent --judge fake
  traceval calibrate traceval-demo/evals/runs/run_20260702T072030171406Z.json

How it works

graph LR
    A[OTel / Langfuse / LangSmith traces] --> B[Canonical trace DB]
    B --> C[Label and cluster]
    C --> D[YAML cases + pytest + rubrics]
    D --> E[Run, diff, calibrate]

Features

Ingests OpenTelemetry GenAI, Langfuse, and LangSmith exports, plus generic JSONL. Malformed lines are logged as warnings instead of crashing the run (tested against corrupt fixtures in tests/fixtures/).
Labels every trace with a rule-based outcome taxonomy (success, tool_error, validation_error, loop, timeout, bad_output) that you can extend with your own Python rules via --rules.
Clusters task shapes with Jaccard shingle similarity, fully offline: no embeddings, no API calls. Numeric tokens are normalized, so "order 57978" and "order 12345" land in the same cluster.
Deterministic generation: regenerating a suite from the same database is byte-identical, so evals diff cleanly in git.
Regression cases are inverted: a failure trace asserts the failure does not recur (forbidden error signatures, tool-loop bounds, non-empty output), never that the agent reproduces it.
Redacts emails, credit cards, phone numbers, and API tokens before case inputs are written (add your own scrubber with --redact-hook).
traceval run exits nonzero on any failing case and diffs against a previous report with --compare, so CI can gate deploys on it.
traceval calibrate measures judge-vs-human agreement per cluster and flags rubrics the automated judge scores unreliably.

Walkthrough on your own traces

The command outputs below are real, captured from a run over the demo trace set (regenerate them with scripts/readme-outputs.sh).

1. Ingest

traceval ingest traces.jsonl -o traces.db   # --format auto|otel|langfuse|langsmith|generic

Ingested 200 traces (209 spans).

Malformed spans do not abort the ingest; warnings are written to <traces.db>.log.

2. Analyze

traceval analyze traces.db -o analysis

Outcomes: success 60% · tool_error 15% · loop 10% · timeout 8% · validation_error 8%
Clusters: 8 task clusters found.
Top failure cluster: "refund stripe -> stripe_lookup -> (tool_error)" (30 traces)
Report written to analysis/report.html

analysis/report.html is the single-file page shown in the screenshot above. Pass --evals evals/ to overlay eval coverage per cluster, and --rules my_rules.py to add your own labeling rules. To view it over HTTP instead of file://, traceval serve analysis starts a stdlib localhost server and prints the report URL.

3. Generate

traceval generate traces.db -o evals --include-failures

Wrote 8 eval cases across 8 clusters → evals/cases/*.yaml
Wrote judge rubrics → evals/rubrics/*.md
Wrote pytest harness → evals/test_generated.py, evals/conftest.py

Every case is a reviewable YAML file. Golden cases assert the recorded successful behavior. Regression cases, generated from failure traces, assert the failure does not recur: forbidden error tokens (word-boundary matched, filtered against tokens that success traces also use), tool-loop bounds, and non-empty output. A regression case passes for any agent that avoids that failure mode; golden cases carry general bug detection.

4. Run

traceval run evals --target myapp.agent:invoke_agent --judge fake

traceval Run Summary
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Case ID              ┃ Cluster    ┃ Outcome ┃ Latency (ms) ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ c_0c422a7a__case_001 │ c_0c422a7a │  PASS   │          0.0 │
│ c_1e5d0942__case_002 │ c_1e5d0942 │  PASS   │          0.0 │
│ c_2c881177__case_003 │ c_2c881177 │  PASS   │          0.0 │
│ c_361535b0__case_004 │ c_361535b0 │  PASS   │          0.0 │
│ c_9a8a4644__case_005 │ c_9a8a4644 │  PASS   │          0.0 │
│ c_d30af83a__case_006 │ c_d30af83a │  PASS   │          0.0 │
│ c_d3f3b631__case_007 │ c_d3f3b631 │  PASS   │          0.0 │
│ c_e834c13c__case_008 │ c_e834c13c │  PASS   │          0.0 │
└──────────────────────┴────────────┴─────────┴──────────────┘
Total: 8 | Passed: 8 | Failed: 0 | Errored: 0

The target is an HTTP URL or a module:function callable. Checks cover exact, contains_any, not_contains, regex, json_schema, tool_sequence, no_tool_loop, and judge. Run reports land in <evals_dir>/runs/ (override with --runs-dir); pass --compare <previous report> to print regressions and improvements between runs. The exit code is nonzero when any case fails.

5. Calibrate the judge

An LLM judge is only as trustworthy as its agreement with human judgment. calibrate samples judge-scored results from a run report and presents each agent output for blind pass/fail labeling in the terminal; judge verdicts stay hidden until the end so they cannot anchor you.

traceval calibrate evals/runs/run_<timestamp>.json --sample 8

Judge Calibration Summary
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┓
┃ Cluster    ┃ Labeled ┃ Agreement ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━┩
│ c_0c422a7a │       1 │      100% │
│ c_1e5d0942 │       1 │      100% │
│ c_2c881177 │       1 │      100% │
│ c_361535b0 │       1 │      100% │
│ c_9a8a4644 │       1 │        0% │
│ c_d30af83a │       1 │      100% │
│ c_d3f3b631 │       1 │      100% │
│ c_e834c13c │       1 │      100% │
└────────────┴─────────┴───────────┘
Overall agreement: 88% on 8 case(s) | false-pass (judge OK, human not): 1 | false-fail: 0
WARNING: Judge unreliable (< 80% agreement) for clusters: c_9a8a4644. Review their rubrics before trusting automated scores.

False-pass counts (judge approved, human rejected) are called out because that is the dangerous direction: a lenient judge waves bad outputs into production. Clusters below --min-agreement (default 80%) are flagged for rubric review, and the full labels plus stats are written to calibration.json.

Scripting with --json

ingest, analyze, generate, and run accept --json: human-readable output is suppressed and a single JSON object is printed to stdout. run still exits nonzero on failures.

traceval analyze traces.db --json | python -m json.tool

GitHub Action

Gate deploys on your generated eval suite. The action installs traceval, runs the suite, and fails the job on any regression:

jobs:
  agent-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: theramkm/traceval@v0.2.2
        with:
          evals-dir: evals/
          target: myapp.agent:invoke_agent   # or an HTTP URL
          judge: fake                        # offline; 'openai' needs an API key

Inputs: evals-dir and target (required); judge, compare, only, runs-dir, traceval-version, python-version (optional). For a real LLM judge, set judge: openai and pass OPENAI_API_KEY (or GEMINI_API_KEY) via env: from your repository secrets.

Development

See CONTRIBUTING.md for setup. Run the test suite with make test and the full gate set with make lint.

Honest Limitations

Side-Effect Free: traceval assertions evaluate input/output matches. It does not attempt to replay side effects (e.g., updating database records) on mock tools.
Text Telemetry: The canonical model is optimized for text logs. Image or multimodal payloads in traces are logged as references.
Static Visualization: The coverage report is a portable, single-file HTML page. There is no hosted web service.

Project details

Release history Release notifications | RSS feed

0.2.6

Jul 2, 2026

0.2.5

Jul 2, 2026

0.2.4

Jul 2, 2026

0.2.3

Jul 2, 2026

This version

0.2.2

Jul 2, 2026

0.2.1

Jul 2, 2026

0.2.0

Jul 2, 2026

0.1.1

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceval-0.2.2.tar.gz (219.9 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceval-0.2.2-py3-none-any.whl (55.8 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file traceval-0.2.2.tar.gz.

File metadata

Download URL: traceval-0.2.2.tar.gz
Upload date: Jul 2, 2026
Size: 219.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceval-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c57d6b04fc8f2e10c1fdcd9cd782df75ccf1ba8eb828b0112533461a82753439`
MD5	`a6cf338b4bd7a3631bad94b86e021ac6`
BLAKE2b-256	`4b9ca52f50f0745ca231f01206ad223a0414631491991fe59664e7b42329cd13`

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceval-0.2.2.tar.gz:

Publisher: ci.yml on theramkm/traceval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: traceval-0.2.2.tar.gz
- Subject digest: c57d6b04fc8f2e10c1fdcd9cd782df75ccf1ba8eb828b0112533461a82753439
- Sigstore transparency entry: 2045241554
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: theramkm/traceval@717c9703b97dd2e38e21f910d877473fd1ffb900
- Branch / Tag: refs/heads/main
- Owner: https://github.com/theramkm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@717c9703b97dd2e38e21f910d877473fd1ffb900
- Trigger Event: push

File details

Details for the file traceval-0.2.2-py3-none-any.whl.

File metadata

Download URL: traceval-0.2.2-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 55.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for traceval-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8f4ce73d48c0bed91dde58746cbd1867049a17dd495355ffaa584c1ffb000f6`
MD5	`ffa6c10931eb5f70be4e67e393ccf994`
BLAKE2b-256	`643f20085d08665026e208bdc9d92a3e4aa2cebd376f777a051243f40503ce6e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for traceval-0.2.2-py3-none-any.whl:

Publisher: ci.yml on theramkm/traceval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: traceval-0.2.2-py3-none-any.whl
- Subject digest: a8f4ce73d48c0bed91dde58746cbd1867049a17dd495355ffaa584c1ffb000f6
- Sigstore transparency entry: 2045241887
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: theramkm/traceval@717c9703b97dd2e38e21f910d877473fd1ffb900
- Branch / Tag: refs/heads/main
- Owner: https://github.com/theramkm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@717c9703b97dd2e38e21f910d877473fd1ffb900
- Trigger Event: push

traceval 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

traceval: Trace-to-Eval Compiler

Quickstart

How it works

Features

Walkthrough on your own traces

1. Ingest

2. Analyze

3. Generate

4. Run

5. Calibrate the judge

Scripting with --json

GitHub Action

Development

Honest Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance