Friendly evaluation and regression-testing framework for AI agents: inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.
Project description
TraceLens / 迹镜
TraceLens is a friendly evaluation and regression-testing framework for AI agents. It turns agent runs into inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.
迹镜是一个面向 AI Agent 的评测与回归检测框架。它把每次 agent run 转化成可观察的轨迹、可评分的结果、可比较的 baseline,以及可用于 CI 的可靠性信号。
Overview
TraceLens provides a unified evaluation methodology for AI agent projects. It supports both subjective evaluations (LLM-as-judge for quality assessment) and objective evaluations (deterministic metrics like schema validity, tool-use constraints, latency, budget, or domain-specific scores).
Architecture
src/tracelens/
├── core/ # Abstract interfaces
│ ├── task.py # Task, TaskLoader, EvalSet
│ ├── trial.py # Trial, TrialBatch execution model
│ ├── grader.py # Grader ABCs (CodeGrader, LLMGrader, CompositeGrader)
│ ├── transcript.py # Agent execution logging
│ ├── decision_spec.py # Reproducibility fingerprinting
│ └── outcome.py # Grading results
├── execution/ # Trial runner
│ ├── runner.py # EvaluationRunner - parallel/concurrent execution
│ ├── agent_adapter.py # AgentAdapter ABC, SimpleAdapter
│ └── registry.py # Plugin loading via dotted import paths
├── statistics/ # Non-determinism handling
│ ├── pass_at_k.py # Capability ceiling (pass@k)
│ ├── consistency.py # Reliability (pass^k)
│ └── inference.py # Bootstrap CI, significance testing
├── baselines/ # Regression detection
│ ├── manager.py # Baseline storage, promotion semantics
│ └── comparison.py # RegressionDetector, severity levels
├── reporting/ # Output
│ └── generator.py # ReportGenerator (markdown, CI summary, HTML)
└── cli/ # Command-line interface
└── main.py # tracelens run / tracelens report
Planned modules:
human_eval/(sample selection, LLM-human reconciliation) is designed but not yet implemented.
Core Concepts
Task
A Task defines a single evaluation test case:
from tracelens import Task
task = Task(
name="Portfolio website decomposition",
input_data={
"goal": "Build a personal portfolio website",
"user_context": {"experience": "beginner", "hours_per_week": 15}
},
category="programming",
tags=["web", "beginner"],
)
Grader
Graders evaluate agent outputs. There are two main types:
CodeGrader - For deterministic metrics:
from tracelens import CodeGrader
class SharpeGrader(CodeGrader):
def compute_metrics(self, transcript, task):
returns = transcript.final_output["returns"]
return {"sharpe_ratio": calculate_sharpe_ratio(returns)}
def determine_pass(self, metrics, task):
passed = metrics["sharpe_ratio"] >= 1.0
score = min(metrics["sharpe_ratio"] / 2.0, 1.0) # Normalize
return passed, score
LLMGrader - For subjective quality (planning, summarisation, helpfulness):
from tracelens import LLMGrader
class SpecificityGrader(LLMGrader):
def build_grading_prompt(self, transcript, task):
return f"""Evaluate specificity of this decomposition:
{transcript.final_output}
Score 1-10 on: concrete actions, quantifiable targets, named resources
"""
def parse_llm_response(self, response, task):
# Parse LLM JSON response
return passed, score, metrics, feedback
Trial
A Trial represents a single execution of a Task:
from tracelens import Trial, TrialStatus
trial = Trial(
task_id=task.task_id,
run_index=0,
total_runs=5, # For pass@k
status=TrialStatus.COMPLETED,
transcript=transcript,
outcomes=[outcome1, outcome2],
)
Non-Determinism Handling
pass@k - Probability of at least one success in k attempts:
- Use for capability evaluation (can the agent solve this at all?)
- Higher k = higher pass@k (more chances to succeed)
pass^k - Probability of all k attempts succeeding:
- Use for reliability evaluation (is the agent consistent?)
- Higher k = lower pass^k (harder to pass every time)
from tracelens.statistics import pass_at_k, pass_to_k
# Capability: can it succeed at least once in 5 tries?
capability = pass_at_k(n=10, c=7, k=5) # 0.99+
# Reliability: will it succeed every time?
reliability = pass_to_k(results=[True, True, False, True, True], k=3) # 0.33
Reproducibility with DecisionSpec
DecisionSpec captures all parameters affecting agent behavior for reproducibility. The fingerprint is a SHA-256 hash of the entire configuration.
from tracelens.core.decision_spec import DecisionSpec, ModelConfig, AgentSpec
# Capture agent configuration
decision_spec = DecisionSpec(
model=ModelConfig(
model_id="gpt-4-turbo",
temperature=0.7,
max_tokens=4096,
),
agent=AgentSpec(
agent_id="goal-decomposer-v2",
version="1.2.3",
git_commit="abc123",
),
global_seed=42,
)
# Get fingerprint for reproducibility tracking
print(f"Fingerprint: {decision_spec.fingerprint[:16]}...")
# Attach to transcript for full reproducibility
transcript = Transcript(
task_id="task-1",
final_output={"result": "..."},
decision_spec=decision_spec,
)
Grader Roles (Must-Pass vs Score-Contributor)
Graders can have two roles in composite evaluation:
- MUST_PASS: Safety/constraint graders. Any failure = trial fails.
- SCORE_CONTRIBUTOR: Quality graders. Contribute to weighted average.
from tracelens import CompositeGrader, GraderRole, GraderConfig
# Safety grader - must pass or entire trial fails
safety_config = GraderConfig(role=GraderRole.MUST_PASS)
safety_grader = FormatValidationGrader("format", config=safety_config)
# Quality grader - contributes to score average
quality_config = GraderConfig(role=GraderRole.SCORE_CONTRIBUTOR)
quality_grader = SpecificityGrader("specificity", config=quality_config)
# Composite: safety failure = trial failure, quality affects score
composite = CompositeGrader(
grader_id="combined",
graders=[
(safety_grader, 0.2), # Weight still affects score
(quality_grader, 0.8), # Higher weight for quality
],
)
outcome = await composite.grade(transcript, task)
# outcome.passed = False if safety_grader fails, regardless of quality score
Baseline Regression Detection
from tracelens.baselines import BaselineManager, RegressionDetector
manager = BaselineManager("baselines/baselines.json")
baseline = manager.get_baseline("btc_backtest")
detector = RegressionDetector(significance_level=0.05)
report = detector.compare(baseline, current_results)
if report.should_block_ci(threshold=RegressionSeverity.MODERATE):
sys.exit(1) # Block the PR
Baseline Promotion (Canary vs Capability)
Baselines can be protected or auto-promoted based on their type:
- CANARY: Protected baselines that never auto-update. Manual promotion only.
- CAPABILITY: Track improvements over time. Auto-promote when criteria met.
- EXPERIMENTAL: For testing. No restrictions.
from tracelens.baselines import BaselineManager, BaselineType, PromotionPolicy
manager = BaselineManager("baselines/baselines.json")
# Create a canary baseline (protected, manual promotion only)
canary = manager.create_canary_baseline(
task_id="critical_safety_check",
metrics={"safety_score": 0.95},
)
# Create capability baseline with auto-promotion policy
policy = PromotionPolicy(
allow_auto_promotion=True,
min_improvement_relative=0.05, # 5% improvement required
min_samples=10,
required_confidence=0.95,
)
capability = manager.create_capability_baseline(
task_id="quality_benchmark",
metrics={"quality_score": 0.75},
policy=policy,
)
# Try auto-promotion (returns True if promoted)
promoted = manager.try_promote(
task_id="quality_benchmark",
new_metrics={"quality_score": 0.82},
sample_count=15,
)
Statistical Inference (Bootstrap CI)
Research-grade statistical comparison with confidence intervals:
from tracelens.statistics.inference import (
compare_metrics,
compare_to_baseline_summary,
estimate_metric,
)
# Compare current run against baseline with bootstrap CI
baseline_values = [0.72, 0.75, 0.71, 0.74, 0.73]
current_values = [0.78, 0.81, 0.79, 0.82, 0.80]
result = compare_metrics(
baseline_values,
current_values,
confidence=0.95,
compute_p_value=True,
)
print(f"Baseline: {result.baseline.mean:.3f} ± {result.baseline.std:.3f}")
print(f"Current: {result.current.mean:.3f} ± {result.current.std:.3f}")
print(f"Difference: {result.difference:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Effect size (Cohen's d): {result.effect_size:.2f}")
print(f"Significant improvement: {result.significant_improvement}")
# Get summary for CI reporting
summary = compare_to_baseline_summary(
baseline_values,
current_values,
metric_name="quality_score",
)
# Returns: "quality_score: 0.800 vs baseline 0.730 (Δ=+0.070, 95% CI [0.045, 0.095], d=1.23, p<0.05)"
CI/CD Integration
GitHub Actions Workflow
- name: Run Evaluation
run: |
tracelens run \
--eval-set eval/suite.json \
--graders quality,personalization \
--num-runs 5 \
--baseline-check \
--fail-on-regression moderate
- name: Comment on PR
run: tracelens report --format github-pr
Regression Thresholds
Configure in baselines/thresholds.py:
THRESHOLDS = {
"sharpe_ratio": {
"direction": "higher_is_better",
"absolute_threshold": -0.2, # Block if drops by 0.2
"relative_threshold": 0.10, # Block if drops by 10%
},
"max_drawdown": {
"direction": "closer_to_zero_is_better",
"absolute_threshold": -0.05,
},
}
Human Evaluation Calibration (Planned)
The
human_eval/module is planned but not yet implemented. The recommended workflow:
Weekly process to calibrate LLM graders:
- Sample Selection: Select 20 diverse samples from recent eval runs
- Human Rating: Rate on 1-10 scale per dimension
- Correlation Analysis: Compare LLM vs human scores
- Grader Tuning: Adjust prompts if correlation < 0.7
See docs/accuracy.md for calibration best practices.
Installation
Until the first PyPI release is published, install directly from GitHub:
# Using uv (recommended)
uv pip install git+https://github.com/ssf0409/tracelens.git
# With LLM support
uv pip install "tracelens[llm] @ git+https://github.com/ssf0409/tracelens.git"
# Or add to pyproject.toml
# dependencies = [
# "tracelens @ git+https://github.com/ssf0409/tracelens.git",
# ]
After the first PyPI release:
uv pip install tracelens
Development Setup
# Clone and install
git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"
# Run tests
uv run pytest tests/ -v
# Run with Docker
docker compose run --rm test
Quick Start
import asyncio
from tracelens import (
Task, EvalSet, SimpleAdapter, CodeGrader,
EvaluationRunner, RunnerConfig, Transcript,
)
from tracelens.reporting.generator import ReportGenerator
# 1. Define tasks
tasks = [
Task(name="Add 2+3", input_data={"a": 2, "b": 3}, metadata={"expected": 5}),
Task(name="Add 10+20", input_data={"a": 10, "b": 20}, metadata={"expected": 30}),
]
eval_set = EvalSet(name="Math Suite", tasks=tasks)
# 2. Write a simple agent
async def math_agent(input_data: dict) -> dict:
return {"answer": input_data["a"] + input_data["b"]}
adapter = SimpleAdapter(math_agent)
# 3. Write a grader
class MathGrader(CodeGrader):
def compute_metrics(self, transcript: Transcript, task: Task) -> dict[str, float]:
expected = task.metadata["expected"]
actual = transcript.final_output["answer"]
return {"correct": float(actual == expected)}
def determine_pass(self, metrics: dict[str, float], task: Task) -> tuple[bool, float]:
return metrics["correct"] == 1.0, metrics["correct"]
# 4. Run evaluation
runner = EvaluationRunner(adapter, [MathGrader("math")], RunnerConfig(num_runs=3))
batch = asyncio.run(runner.run(eval_set))
# 5. Generate report
gen = ReportGenerator()
report = gen.build_report(batch)
print(gen.render_markdown(report))
Five-minute version:
examples/hello_world.py. Walkthrough: docs/getting-started.md.
Documentation
- Getting Started — Run your first eval in five minutes; the example ladder.
- Quickstart — Build a custom grader, JSON task loader, and CLI workflow.
- Supported Scenarios — Which agent-evaluation problems TraceLens is designed for.
- User Guide — Comprehensive framework guide.
- Evaluation Levels — Function, task, and system-level evaluation architecture.
- Accuracy Best Practices — LLM-judge calibration and grader drift.
- CI/CD Integration — GitHub Actions with regression gating.
- Examples — Four working scripts:
hello_world.py→contract_eval.py→http_agent_eval.py→noise_aware_regression.py. - Releasing — Maintainer guide for tag-driven PyPI releases.
Contributing
TraceLens is MIT licensed and open to contributions. Start with CONTRIBUTING.md, run the local verification gate, and open a focused PR:
uv run --frozen pytest -q
uv run --frozen ruff check src/ tests/ examples/ benchmarks/high-stakes-autonomous
uv run --frozen --extra dev mypy src/tracelens/
Security issues should be reported privately using SECURITY.md.
References
Key Design Principles
From Anthropic's evaluation guide:
- Grade outcomes, not execution paths - Focus on what the agent produced
- Handle non-determinism with pass@k and pass^k - Different metrics for capability vs reliability
- Start with 20-50 real failure cases - Build from actual issues
- Read transcripts regularly - Catch false signals and grader bugs
- Calibrate with human evaluation - LLM graders drift without calibration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracelens-0.1.0.tar.gz.
File metadata
- Download URL: tracelens-0.1.0.tar.gz
- Upload date:
- Size: 253.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd7322e0b8e2414f9968ba06b7b75657792a6c7413c35fdc724c91e7af4118b8
|
|
| MD5 |
50909a12e2739cf2db151843ad86d909
|
|
| BLAKE2b-256 |
169c79108e6d4d9e5c9597033c2b7c8d0bed70e760f4a8d2a5c160a207d96cfe
|
Provenance
The following attestation bundles were made for tracelens-0.1.0.tar.gz:
Publisher:
release.yml on ssf0409/tracelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracelens-0.1.0.tar.gz -
Subject digest:
cd7322e0b8e2414f9968ba06b7b75657792a6c7413c35fdc724c91e7af4118b8 - Sigstore transparency entry: 1576202418
- Sigstore integration time:
-
Permalink:
ssf0409/tracelens@57e44874a7ac779766be52498c598f2b48a26248 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ssf0409
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@57e44874a7ac779766be52498c598f2b48a26248 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tracelens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tracelens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 88.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3835a771c1f0d68efa74e02292b8a144a7036c1ed3bcc0c626fba6c3b6f5c4f
|
|
| MD5 |
8dae331f3c7ad7338e1fa3e1c26ce94a
|
|
| BLAKE2b-256 |
96a8aeae2c4088979f923654a00b4601d07fb13326e60d932ba82b4188a1f669
|
Provenance
The following attestation bundles were made for tracelens-0.1.0-py3-none-any.whl:
Publisher:
release.yml on ssf0409/tracelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tracelens-0.1.0-py3-none-any.whl -
Subject digest:
b3835a771c1f0d68efa74e02292b8a144a7036c1ed3bcc0c626fba6c3b6f5c4f - Sigstore transparency entry: 1576202470
- Sigstore integration time:
-
Permalink:
ssf0409/tracelens@57e44874a7ac779766be52498c598f2b48a26248 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ssf0409
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@57e44874a7ac779766be52498c598f2b48a26248 -
Trigger Event:
push
-
Statement type: