Friendly evaluation and regression-testing framework for AI agents: inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ssf0409

These details have not been verified by PyPI

Project description

TraceLens / 迹镜

TraceLens is a friendly evaluation and regression-testing framework for AI agents. It turns agent runs into inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.

迹镜是一个面向 AI Agent 的评测与回归检测框架。它把每次 agent run 转化成可观察的轨迹、可评分的结果、可比较的 baseline，以及可用于 CI 的可靠性信号。

Overview

TraceLens provides a unified evaluation methodology for AI agent projects. It supports both subjective evaluations (LLM-as-judge for quality assessment) and objective evaluations (deterministic metrics like schema validity, tool-use constraints, latency, budget, or domain-specific scores).

Architecture

src/tracelens/
├── core/                    # Abstract interfaces
│   ├── task.py              # Task, TaskLoader, EvalSet
│   ├── trial.py             # Trial, TrialBatch execution model
│   ├── grader.py            # Grader ABCs (CodeGrader, LLMGrader, CompositeGrader)
│   ├── transcript.py        # Agent execution logging
│   ├── decision_spec.py     # Reproducibility fingerprinting
│   └── outcome.py           # Grading results
├── execution/               # Trial runner
│   ├── runner.py            # EvaluationRunner - parallel/concurrent execution
│   ├── agent_adapter.py     # AgentAdapter ABC, SimpleAdapter
│   └── registry.py          # Plugin loading via dotted import paths
├── statistics/              # Non-determinism handling
│   ├── pass_at_k.py         # Capability ceiling (pass@k)
│   ├── consistency.py       # Reliability (pass^k)
│   └── inference.py         # Bootstrap CI, significance testing
├── baselines/               # Regression detection
│   ├── manager.py           # Baseline storage, promotion semantics
│   └── comparison.py        # RegressionDetector, severity levels
├── reporting/               # Output
│   └── generator.py         # ReportGenerator (markdown, CI summary, HTML)
└── cli/                     # Command-line interface
    └── main.py              # tracelens run / tracelens report

Planned modules: human_eval/ (sample selection, LLM-human reconciliation) is designed but not yet implemented.

Core Concepts

Task

A Task defines a single evaluation test case:

from tracelens import Task

task = Task(
    name="Portfolio website decomposition",
    input_data={
        "goal": "Build a personal portfolio website",
        "user_context": {"experience": "beginner", "hours_per_week": 15}
    },
    category="programming",
    tags=["web", "beginner"],
)

Grader

Graders evaluate agent outputs. There are two main types:

CodeGrader - For deterministic metrics:

from tracelens import CodeGrader

class SharpeGrader(CodeGrader):
    def compute_metrics(self, transcript, task):
        returns = transcript.final_output["returns"]
        return {"sharpe_ratio": calculate_sharpe_ratio(returns)}

    def determine_pass(self, metrics, task):
        passed = metrics["sharpe_ratio"] >= 1.0
        score = min(metrics["sharpe_ratio"] / 2.0, 1.0)  # Normalize
        return passed, score

LLMGrader - For subjective quality (planning, summarisation, helpfulness):

from tracelens import LLMGrader

class SpecificityGrader(LLMGrader):
    def build_grading_prompt(self, transcript, task):
        return f"""Evaluate specificity of this decomposition:
        {transcript.final_output}

        Score 1-10 on: concrete actions, quantifiable targets, named resources
        """

    def parse_llm_response(self, response, task):
        # Parse LLM JSON response
        return passed, score, metrics, feedback

Trial

A Trial represents a single execution of a Task:

from tracelens import Trial, TrialStatus

trial = Trial(
    task_id=task.task_id,
    run_index=0,
    total_runs=5,  # For pass@k
    status=TrialStatus.COMPLETED,
    transcript=transcript,
    outcomes=[outcome1, outcome2],
)

Non-Determinism Handling

pass@k - Probability of at least one success in k attempts:

Use for capability evaluation (can the agent solve this at all?)
Higher k = higher pass@k (more chances to succeed)

pass^k - Probability of all k attempts succeeding:

Use for reliability evaluation (is the agent consistent?)
Higher k = lower pass^k (harder to pass every time)

from tracelens.statistics import pass_at_k, pass_to_k

# Capability: can it succeed at least once in 5 tries?
capability = pass_at_k(n=10, c=7, k=5)  # 0.99+

# Reliability: will it succeed every time?
reliability = pass_to_k(results=[True, True, False, True, True], k=3)  # 0.33

Reproducibility with DecisionSpec

DecisionSpec captures all parameters affecting agent behavior for reproducibility. The fingerprint is a SHA-256 hash of the entire configuration.

from tracelens.core.decision_spec import DecisionSpec, ModelConfig, AgentSpec

# Capture agent configuration
decision_spec = DecisionSpec(
    model=ModelConfig(
        model_id="gpt-4-turbo",
        temperature=0.7,
        max_tokens=4096,
    ),
    agent=AgentSpec(
        agent_id="goal-decomposer-v2",
        version="1.2.3",
        git_commit="abc123",
    ),
    global_seed=42,
)

# Get fingerprint for reproducibility tracking
print(f"Fingerprint: {decision_spec.fingerprint[:16]}...")

# Attach to transcript for full reproducibility
transcript = Transcript(
    task_id="task-1",
    final_output={"result": "..."},
    decision_spec=decision_spec,
)

Grader Roles (Must-Pass vs Score-Contributor)

Graders can have two roles in composite evaluation:

MUST_PASS: Safety/constraint graders. Any failure = trial fails.
SCORE_CONTRIBUTOR: Quality graders. Contribute to weighted average.

from tracelens import CompositeGrader, GraderRole, GraderConfig

# Safety grader - must pass or entire trial fails
safety_config = GraderConfig(role=GraderRole.MUST_PASS)
safety_grader = FormatValidationGrader("format", config=safety_config)

# Quality grader - contributes to score average
quality_config = GraderConfig(role=GraderRole.SCORE_CONTRIBUTOR)
quality_grader = SpecificityGrader("specificity", config=quality_config)

# Composite: safety failure = trial failure, quality affects score
composite = CompositeGrader(
    grader_id="combined",
    graders=[
        (safety_grader, 0.2),   # Weight still affects score
        (quality_grader, 0.8),  # Higher weight for quality
    ],
)

outcome = await composite.grade(transcript, task)
# outcome.passed = False if safety_grader fails, regardless of quality score

Baseline Regression Detection

from tracelens.baselines import BaselineManager, RegressionDetector

manager = BaselineManager("baselines/baselines.json")
baseline = manager.get_baseline("btc_backtest")

detector = RegressionDetector(significance_level=0.05)
report = detector.compare(baseline, current_results)

if report.should_block_ci(threshold=RegressionSeverity.MODERATE):
    sys.exit(1)  # Block the PR

Baseline Promotion (Canary vs Capability)

Baselines can be protected or auto-promoted based on their type:

CANARY: Protected baselines that never auto-update. Manual promotion only.
CAPABILITY: Track improvements over time. Auto-promote when criteria met.
EXPERIMENTAL: For testing. No restrictions.

from tracelens.baselines import BaselineManager, BaselineType, PromotionPolicy

manager = BaselineManager("baselines/baselines.json")

# Create a canary baseline (protected, manual promotion only)
canary = manager.create_canary_baseline(
    task_id="critical_safety_check",
    metrics={"safety_score": 0.95},
)

# Create capability baseline with auto-promotion policy
policy = PromotionPolicy(
    allow_auto_promotion=True,
    min_improvement_relative=0.05,  # 5% improvement required
    min_samples=10,
    required_confidence=0.95,
)
capability = manager.create_capability_baseline(
    task_id="quality_benchmark",
    metrics={"quality_score": 0.75},
    policy=policy,
)

# Try auto-promotion (returns True if promoted)
promoted = manager.try_promote(
    task_id="quality_benchmark",
    new_metrics={"quality_score": 0.82},
    sample_count=15,
)

Statistical Inference (Bootstrap CI)

Research-grade statistical comparison with confidence intervals:

from tracelens.statistics.inference import (
    compare_metrics,
    compare_to_baseline_summary,
    estimate_metric,
)

# Compare current run against baseline with bootstrap CI
baseline_values = [0.72, 0.75, 0.71, 0.74, 0.73]
current_values = [0.78, 0.81, 0.79, 0.82, 0.80]

result = compare_metrics(
    baseline_values,
    current_values,
    confidence=0.95,
    compute_p_value=True,
)

print(f"Baseline: {result.baseline.mean:.3f} ± {result.baseline.std:.3f}")
print(f"Current:  {result.current.mean:.3f} ± {result.current.std:.3f}")
print(f"Difference: {result.difference:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Effect size (Cohen's d): {result.effect_size:.2f}")
print(f"Significant improvement: {result.significant_improvement}")

# Get summary for CI reporting
summary = compare_to_baseline_summary(
    baseline_values,
    current_values,
    metric_name="quality_score",
)
# Returns: "quality_score: 0.800 vs baseline 0.730 (Δ=+0.070, 95% CI [0.045, 0.095], d=1.23, p<0.05)"

CI/CD Integration

GitHub Actions Workflow

- name: Run Evaluation
  run: |
    tracelens run \
      --eval-set eval/suite.json \
      --graders quality,personalization \
      --num-runs 5 \
      --baseline-check \
      --fail-on-regression moderate

- name: Comment on PR
  run: tracelens report --format github-pr

Regression Thresholds

Configure in baselines/thresholds.py:

THRESHOLDS = {
    "sharpe_ratio": {
        "direction": "higher_is_better",
        "absolute_threshold": -0.2,  # Block if drops by 0.2
        "relative_threshold": 0.10,   # Block if drops by 10%
    },
    "max_drawdown": {
        "direction": "closer_to_zero_is_better",
        "absolute_threshold": -0.05,
    },
}

Human Evaluation Calibration (Planned)

The human_eval/ module is planned but not yet implemented. The recommended workflow:

Weekly process to calibrate LLM graders:

Sample Selection: Select 20 diverse samples from recent eval runs
Human Rating: Rate on 1-10 scale per dimension
Correlation Analysis: Compare LLM vs human scores
Grader Tuning: Adjust prompts if correlation < 0.7

See docs/accuracy.md for calibration best practices.

Installation

Until the first PyPI release is published, install directly from GitHub:

# Using uv (recommended)
uv pip install git+https://github.com/ssf0409/tracelens.git

# With LLM support
uv pip install "tracelens[llm] @ git+https://github.com/ssf0409/tracelens.git"

# Or add to pyproject.toml
# dependencies = [
#     "tracelens @ git+https://github.com/ssf0409/tracelens.git",
# ]

After the first PyPI release:

uv pip install tracelens

Development Setup

# Clone and install
git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"

# Run tests
uv run pytest tests/ -v

# Run with Docker
docker compose run --rm test

Quick Start

import asyncio
from tracelens import (
    Task, EvalSet, SimpleAdapter, CodeGrader,
    EvaluationRunner, RunnerConfig, Transcript,
)
from tracelens.reporting.generator import ReportGenerator

# 1. Define tasks
tasks = [
    Task(name="Add 2+3", input_data={"a": 2, "b": 3}, metadata={"expected": 5}),
    Task(name="Add 10+20", input_data={"a": 10, "b": 20}, metadata={"expected": 30}),
]
eval_set = EvalSet(name="Math Suite", tasks=tasks)

# 2. Write a simple agent
async def math_agent(input_data: dict) -> dict:
    return {"answer": input_data["a"] + input_data["b"]}

adapter = SimpleAdapter(math_agent)

# 3. Write a grader
class MathGrader(CodeGrader):
    def compute_metrics(self, transcript: Transcript, task: Task) -> dict[str, float]:
        expected = task.metadata["expected"]
        actual = transcript.final_output["answer"]
        return {"correct": float(actual == expected)}

    def determine_pass(self, metrics: dict[str, float], task: Task) -> tuple[bool, float]:
        return metrics["correct"] == 1.0, metrics["correct"]

# 4. Run evaluation
runner = EvaluationRunner(adapter, [MathGrader("math")], RunnerConfig(num_runs=3))
batch = asyncio.run(runner.run(eval_set))

# 5. Generate report
gen = ReportGenerator()
report = gen.build_report(batch)
print(gen.render_markdown(report))

Five-minute version: examples/hello_world.py. Walkthrough: docs/getting-started.md.

Documentation

Getting Started — Run your first eval in five minutes; the example ladder.
Quickstart — Build a custom grader, JSON task loader, and CLI workflow.
Supported Scenarios — Which agent-evaluation problems TraceLens is designed for.
User Guide — Comprehensive framework guide.
Evaluation Levels — Function, task, and system-level evaluation architecture.
Accuracy Best Practices — LLM-judge calibration and grader drift.
CI/CD Integration — GitHub Actions with regression gating.
Examples — Four working scripts: hello_world.py → contract_eval.py → http_agent_eval.py → noise_aware_regression.py.
Releasing — Maintainer guide for tag-driven PyPI releases.

Contributing

TraceLens is MIT licensed and open to contributions. Start with CONTRIBUTING.md, run the local verification gate, and open a focused PR:

uv run --frozen pytest -q
uv run --frozen ruff check src/ tests/ examples/ benchmarks/high-stakes-autonomous
uv run --frozen --extra dev mypy src/tracelens/

Security issues should be reported privately using SECURITY.md.

References

Anthropic: Demystifying Evals for AI Agents

Key Design Principles

From Anthropic's evaluation guide:

Grade outcomes, not execution paths - Focus on what the agent produced
Handle non-determinism with pass@k and pass^k - Different metrics for capability vs reliability
Start with 20-50 real failure cases - Build from actual issues
Read transcripts regularly - Catch false signals and grader bugs
Calibrate with human evaluation - LLM graders drift without calibration

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ssf0409

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

May 20, 2026

This version

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracelens-0.1.0.tar.gz (253.3 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracelens-0.1.0-py3-none-any.whl (88.5 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file tracelens-0.1.0.tar.gz.

File metadata

Download URL: tracelens-0.1.0.tar.gz
Upload date: May 20, 2026
Size: 253.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracelens-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cd7322e0b8e2414f9968ba06b7b75657792a6c7413c35fdc724c91e7af4118b8`
MD5	`50909a12e2739cf2db151843ad86d909`
BLAKE2b-256	`169c79108e6d4d9e5c9597033c2b7c8d0bed70e760f4a8d2a5c160a207d96cfe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracelens-0.1.0.tar.gz:

Publisher: release.yml on ssf0409/tracelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracelens-0.1.0.tar.gz
- Subject digest: cd7322e0b8e2414f9968ba06b7b75657792a6c7413c35fdc724c91e7af4118b8
- Sigstore transparency entry: 1576202418
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: ssf0409/tracelens@57e44874a7ac779766be52498c598f2b48a26248
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ssf0409
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@57e44874a7ac779766be52498c598f2b48a26248
- Trigger Event: push

File details

Details for the file tracelens-0.1.0-py3-none-any.whl.

File metadata

Download URL: tracelens-0.1.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 88.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracelens-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3835a771c1f0d68efa74e02292b8a144a7036c1ed3bcc0c626fba6c3b6f5c4f`
MD5	`8dae331f3c7ad7338e1fa3e1c26ce94a`
BLAKE2b-256	`96a8aeae2c4088979f923654a00b4601d07fb13326e60d932ba82b4188a1f669`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracelens-0.1.0-py3-none-any.whl:

Publisher: release.yml on ssf0409/tracelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tracelens-0.1.0-py3-none-any.whl
- Subject digest: b3835a771c1f0d68efa74e02292b8a144a7036c1ed3bcc0c626fba6c3b6f5c4f
- Sigstore transparency entry: 1576202470
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: ssf0409/tracelens@57e44874a7ac779766be52498c598f2b48a26248
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ssf0409
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@57e44874a7ac779766be52498c598f2b48a26248
- Trigger Event: push

tracelens 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

TraceLens / 迹镜

Overview

Architecture

Core Concepts

Task

Grader

Trial

Non-Determinism Handling

Reproducibility with DecisionSpec

Grader Roles (Must-Pass vs Score-Contributor)

Baseline Regression Detection

Baseline Promotion (Canary vs Capability)

Statistical Inference (Bootstrap CI)

CI/CD Integration

GitHub Actions Workflow

Regression Thresholds

Human Evaluation Calibration (Planned)

Installation

Development Setup

Quick Start

Documentation

Contributing

References

Key Design Principles

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance