Trace-to-Eval Compiler
Project description
🚀 traceval: Trace-to-Eval Compiler
"Your traces already know how your agent fails. traceval turns them into the test suite you never wrote."
Teams running LLM agents in production have observability traces, but only a fraction maintain robust evals. The raw material for great tests — thousands of real production traces, including edge cases and errors — sits unused because converting them into regression suites is manual and tedious.
traceval automates this by ingesting agent traces from standard sources, normalizing them into a canonical Pydantic model, analyzing outcomes/clustering task signatures, and compiling them into a human-editable eval suite: pytest files + YAML datasets + judge rubric scaffolds.
🎨 Architectural Pipeline
graph LR
classDef source fill:#2c3e50,stroke:#34495e,stroke-width:2px,color:#fff;
classDef normalize fill:#16a085,stroke:#1abc9c,stroke-width:2px,color:#fff;
classDef analyze fill:#2980b9,stroke:#3498db,stroke-width:2px,color:#fff;
classDef compile fill:#8e44ad,stroke:#9b59b6,stroke-width:2px,color:#fff;
classDef run fill:#d35400,stroke:#e67e22,stroke-width:2px,color:#fff;
A[OTel / Langfuse / LangSmith] --> B(Canonical Trace DB)
B --> C(Outcome Labeler & Jaccard Clusterer)
C --> D(YAML cases + Pytest + Rubrics)
D --> E(HTTP / Callable Runner & Diff Reports)
class A source;
class B normalize;
class C analyze;
class D compile;
class E run;
✨ Key Features
- 🔌 Zero-Configuration Ingest: Direct compatibility with OpenTelemetry GenAI semantic conventions, Langfuse observations, LangSmith runs, or generic JSONL exports.
- 🧠 Smart Outcome Taxonomy: Automatic categorization of trace outcomes (
success,tool_error,validation_error,loop,timeout,bad_output) using rule-based heuristics that you can extend with Python modules. - 📊 Embedding-Free Clustering: Fast, local Jaccard-similarity shingle grouping that runs 100% offline, keeping your development cycle private and deterministic.
- 📝 Clean Code Generation: Compiles cases into editable YAML files, LLM-as-a-judge rubrics into Markdown checklist scaffolds, and pytest test runs into clean templates.
- ⚡ PII Redaction Safeguards: Automatically scrubs emails, credit cards, phone numbers, and API tokens before writing test inputs.
- 🛡️ CI/CD Regression Diff: Compares execution summaries and scores between runs using exit codes to catch agent failures before deploying.
⏱️ 90-Second E2E Quickstart
Experience traceval regression testing out of the box using our interactive demo script:
# Clone & run the demo
chmod +x examples/demo.sh
./examples/demo.sh
Manual Walkthrough
1. Ingest Observability Logs
# Seed 200 synthetic telemetry traces containing successes and failure edge cases
python3 examples/make_traces.py
# Ingest into SQLite database
traceval ingest examples/synthetic_traces.jsonl -o traces.db
2. Label & Analyze Traffic Gaps
traceval analyze traces.db -o analysis/
Outputs outcome statistics and generates analysis/report.html mapping traffic clusters:
Outcomes: success 60% · tool_error 15% · loop 10% · timeout 8% · validation_error 8%
Clusters: 37 task clusters found.
Top failure cluster: "500 refund stripe -> stripe_lookup -> (tool_error)" (30 traces)
Report written to analysis/report.html
3. Compile Cases and Pytest Harness
traceval generate traces.db -o evals/ --include-failures
Generates test parameters evals/cases/ and rubric Markdown checklists evals/rubrics/.
4. Run Evaluations & Detect Regressions
# Run against the healthy agent (100% Pass)
traceval run evals/ --target examples.demo_agent.agent:invoke_agent --judge fake
# Run against the buggy agent (Detects regressions and exits with status 1)
BUGGY=true traceval run evals/ --target examples.demo_agent.agent:invoke_agent --judge fake
🛠️ CLI Command Reference
[!NOTE] All CLI commands support
--jsonto output machine-readable stdout for scripting.
Ingestion
traceval ingest <path> --format [auto|otel|langfuse|langsmith|generic] -o <traces.db>
Ingests telemetry log dumps losslessly. Malformed spans write warnings to <traces.db>.log.
Analysis
traceval analyze <traces.db> [--rules custom_rules.py] [--evals evals/] -o <analysis_dir/>
Runs rule pipelines and Jaccard shingle similarity groupings.
Generation
traceval generate <traces.db> -o <evals_dir/> [--per-cluster 3] [--include-failures] [--redact-hook module:fn]
Creates regression cases, Markdown LLM-judge checklists, and conftest runners.
Runner
traceval run <evals_dir/> --target <url|module:function> [--judge fake|openai] [--compare runs/prev.json]
Executes tests, scores output constraints (exact, contains, regex, json_schema, tool_sequence, judge), and logs to project-level runs/ directory.
💡 Honest Limitations
- Side-Effect Free: traceval assertions evaluate input/output matches. It does not attempt to replay side effects (e.g., updating database records) on mock tools.
- Text Telemetry: The canonical model is optimized for text logs. Image or multimodal payloads in traces are logged as references.
- Static Visualization: The coverage report is a portable, single-file HTML page. There is no hosted web service.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceval-0.1.1.tar.gz.
File metadata
- Download URL: traceval-0.1.1.tar.gz
- Upload date:
- Size: 122.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
638521c23d0deaa0f50f4e846443c9b30587d2a0cb96e92ec240495af505ea4a
|
|
| MD5 |
6cfa9b8ef8de4c1afc05d8ebd834e2c0
|
|
| BLAKE2b-256 |
f96451ecab645a48e50b8120a5d379c8862fbb363c85ab1868ee8aa2c09af4f7
|
Provenance
The following attestation bundles were made for traceval-0.1.1.tar.gz:
Publisher:
ci.yml on theramkm/traceval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
traceval-0.1.1.tar.gz -
Subject digest:
638521c23d0deaa0f50f4e846443c9b30587d2a0cb96e92ec240495af505ea4a - Sigstore transparency entry: 2044087337
- Sigstore integration time:
-
Permalink:
theramkm/traceval@49177683f3d96b89f779c541fa03b265616894d6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/theramkm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@49177683f3d96b89f779c541fa03b265616894d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file traceval-0.1.1-py3-none-any.whl.
File metadata
- Download URL: traceval-0.1.1-py3-none-any.whl
- Upload date:
- Size: 42.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1034d55658c64af15ff3319f73b8c21aab187e26cd5df5d80ff8bd7411a783d2
|
|
| MD5 |
b45036f70f71b4a3fd8aa2594183134e
|
|
| BLAKE2b-256 |
b7c5d8a12c67198c4aacf6ac066ccb4220c7af46f9b2ac116f75f6a084233661
|
Provenance
The following attestation bundles were made for traceval-0.1.1-py3-none-any.whl:
Publisher:
ci.yml on theramkm/traceval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
traceval-0.1.1-py3-none-any.whl -
Subject digest:
1034d55658c64af15ff3319f73b8c21aab187e26cd5df5d80ff8bd7411a783d2 - Sigstore transparency entry: 2044087358
- Sigstore integration time:
-
Permalink:
theramkm/traceval@49177683f3d96b89f779c541fa03b265616894d6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/theramkm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@49177683f3d96b89f779c541fa03b265616894d6 -
Trigger Event:
push
-
Statement type: