Local-first efficiency supervision, regression testing, and verified before/after evaluation for AI agents

These details have not been verified by PyPI

Project links

Project description

TraceRazor

A local-first efficiency supervisor for AI agents. TraceRazor captures runs, diagnoses structural token waste, recommends risk-labelled improvements, and recognizes savings only after a task-preserving rerun.

TraceRazor is not another hosted tracing dashboard. Its stable product is offline efficiency QA: deterministic audits, same-workload regression checks, advisory coaching, and tamper-evident verification. TAS is ordinal and projected token or dollar savings remain estimates until measured with tracerazor bench.

· MIT · Rust + Python · Author: Zulfaqar Hafez

Public trust status: trust matrix · security policy · citation

60-second start

The following is the 1.1 distribution contract. Use it after the trust matrix marks the platform artifacts published; until then, build the current checkout with cargo build --release -p tracerazor and set TRACERAZOR_BIN to that binary.

pip install "tracerazor[mcp]>=1.1,<2"  # platform wheel bundles the native auditor

tracerazor agent doctor --format json
tracerazor agent install --host auto --scope project --mode coach --dry-run
tracerazor audit my-trace.json --hermetic --format json

Exit codes: 0 pass · 1 threshold gate failed (only with --threshold) · 2 error (unreadable/invalid input).

Your own traces: LangSmith, OTel/Phoenix, Langfuse, and Claude Code exports parse directly (-F langsmith / -F otel / -F phoenix / -F langfuse / -F claude-code); plain OpenAI/Anthropic chat logs convert with python tools/convert_openai.py chat.json -o trace.json. Native schema: docs/trace-format.md (JSON Schema).

Trusted agent setup: previewing agent install changes nothing. Re-run without --dry-run only after reviewing the detected host and paths. Coach mode captures and advises but never edits prompts, tools, settings, or working files. Completed runs use the common .tracerazor/runs/<run-id>/ envelope.

tracerazor agent install --host auto --scope project --mode coach
tracerazor agent status --format json
tracerazor agent verify-receipt .tracerazor/runs/<run-id>/run-receipt.json \
  --verify-key "$TRACERAZOR_VERIFY_KEY" --format json

The legacy tracerazor claude install command remains available during the 1.x compatibility window.

For AI agents

TraceRazor can be discovered by a coding agent and configured once by a trusted user. Automatic capture consumes the transcript path supplied by each host: Claude Code at SessionEnd/SubagentStop, Codex at Stop/SubagentStop, and Gemini CLI at AfterAgent with a best-effort SessionEnd fallback. Missing, unreadable, or unrecognized transcripts produce an explicit partial receipt, never a false successful audit.

Agent Skill: the canonical workflow is checked into .agents/skills/tracerazor for repo discovery and packaged into every supported host bundle.
Codex plugin: plugins/tracerazor bundles the skill, local MCP server, and review-before-trust lifecycle hooks.
Claude Code plugin: extensions/claude-code/tracerazor packages the same skill, MCP server, and session/subagent hooks.
Gemini CLI extension: extensions/gemini-cli/tracerazor packages the skill, MCP server, and JSON-only lifecycle hooks.
Conventions: AGENTS.md (command reference + exit-code contract + honesty rules) and llms.txt (doc index) are in the repo root; the end-to-end recipe is docs/AGENT_GUIDE.md, and trusted host/runtime setup is documented in docs/agent-native.md.
Portable bootstrap: tracerazor agent install --host auto --scope project --mode coach detects Codex, Claude Code, Gemini CLI, or the generic wrapper.
MCP: tracerazor-mcp exposes diagnosis, comparison, policy checks, fix previews, and evidence verification; see docs/MCP.md.
Spawned machines: after the 1.1 release gate passes, use the signed platform wheel, standalone release binary, or ghcr.io/zulfaqarhafez/tracerazor-agent:v1.1.0. Image provisioning is the explicit trust event; no package post-install modifies host configuration.

For machine runs always use --hermetic --format json; exit code 1 means only an explicit gate failed, never a low score.

What TraceRazor Does

TraceRazor v1.1.0 closes a full loop: audit a trace offline, apply the emitted fixes, measure the real before/after delta at constant task outcome, and let anyone verify the report cryptographically.

flowchart LR
    T["📄 Trace JSON<br/>LangSmith · OTel · raw ·<br/>Claude Code transcripts"]

    subgraph AUDIT["1 · AUDIT — offline, no API keys, ~ms"]
        A["8 composite signals<br/>+ 6 diagnostics"] --> R["Report<br/>TAS 0–100 + fix patches<br/>+ run manifest<br/>+ Ed25519 signature"]
    end

    subgraph MEASURE["2 · MEASURE — the only real proof"]
        F["apply<br/>safe patches → prompt"] --> RR["re-run agent"]
        RR --> B["bench<br/>measured Δ tokens at<br/>constant pass rate"]
    end

    subgraph VERIFY["3 · VERIFY — anyone, anywhere"]
        V["signature ✓ → hash ✓ → re-score ✓<br/>any edited field ⇒ TAMPERED, exit 1"]
    end

    T --> A
    R --> F
    R --> V
    B -.->|next audit| T

Audit runs offline (no API keys; low single-digit milliseconds on typical traces). Measure turns the audit's heuristic savings estimates into measured deltas — see the live case study, which caught one of our own fixes costing tokens and verified the repair. Verify lets a third party check that a score is authentic and reproducible. Experimental sampling and substitutability work is demoted to Labs status.

TRICE (Labs): deterministic live context control

TRICE is an experimental engine behind TraceRazor for decision-preserving context control under a live verifier. It is not the public product and has no efficacy claim yet. It is deliberately stricter than replay-only prompt compression: a policy is accepted only when a fresh workspace run keeps the task passing, emits a deterministic evidence manifest, and meets the user-conditioned input-token target.

flowchart LR
    P["Task prompt + repo"] --> C["TRICE context policy"]
    C --> A["Adapter profile or repair command"]
    A --> R["run_receipt.json<br/>command hash + changed files + token receipt"]
    R --> V["Verifier command"]
    V --> E["Evidence manifest + .trice.zip"]

from tracerazor.trice import LiveTask, run_live_learning_loop, verify_manifest

task = LiveTask.from_dir("benchmark/live/tasks/fix-offby-one")
result = run_live_learning_loop(
    [task],
    out_dir="benchmark/trice/results/example",
    user_feedback="real runs, not replay; target 60% token savings",
    rounds=1,
)
assert verify_manifest(result.manifest_path)["ok"]

Run the same deterministic contract against any local repo with either a JSON patch adapter or a deterministic repair command:

tracerazor-trice run -- \
  --repo benchmark/live/tasks/fix-offby-one/seed \
  --task-id fix-offby-one-generic \
  --prompt "Fix chunker.py without editing tests." \
  --verify-cmd "python -m pytest -q --tb=short" \
  --patch-spec examples/trice_patch_fix_offbyone.json \
  --out-dir benchmark/trice/results/generic-example

tracerazor-trice run -- \
  --repo benchmark/live/tasks/fix-offby-one/seed \
  --task-id fix-offby-one-command \
  --prompt "Fix chunker.py without editing tests." \
  --verify-cmd "python -m pytest -q --tb=short" \
  --repair-cmd "python scripts/repair_agent.py" \
  --out-dir benchmark/trice/results/command-example

tracerazor-trice run -- \
  --repo benchmark/live/tasks/fix-offby-one/seed \
  --task-id fix-offby-one-profile \
  --prompt "Fix chunker.py without editing tests." \
  --verify-cmd "python -m pytest -q --tb=short" \
  --adapter-profile adapters/my-agent.json \
  --out-dir benchmark/trice/results/profile-example

For multi-repo evaluation, define a suite manifest and deep-verify the aggregate plus child manifests:

tracerazor-trice suite examples/trice_suite_fix_offbyone.json \
  --out-dir benchmark/trice/results/v2-suite \
  --rounds 1 \
  --replicates 3
tracerazor-trice verify-suite benchmark/trice/results/v2-suite/trice_suite_evidence_manifest.json
tracerazor-trice bundle benchmark/trice/results/v2-suite/trice_suite_evidence_manifest.json \
  --out benchmark/trice/results/v2-suite/trice_suite_evidence.trice.zip
tracerazor-trice verify-bundle benchmark/trice/results/v2-suite/trice_suite_evidence.trice.zip

For held-out remote repositories, scaffold a locked manifest first:

tracerazor-trice suite scaffold \
  --source remote-git-list.json \
  --out suite.json
tracerazor-trice suite suite.json --out-dir benchmark/trice/results/heldout-pilot

The repo includes a tiny remote-git smoke fixture that exercises the public clone path against PyPA sampleproject at a locked commit:

tracerazor-trice suite scaffold \
  --source examples/trice_remote_smoke_source.json \
  --out examples/trice_remote_smoke_suite.json
tracerazor-trice suite examples/trice_remote_smoke_suite.json \
  --out-dir benchmark/trice/results/v2-remote-smoke \
  --rounds 1 \
  --replicates 1
tracerazor-trice verify-suite benchmark/trice/results/v2-remote-smoke/trice_suite_evidence_manifest.json
tracerazor-trice verify-claim docs/trice_remote_smoke_claim_card.json
tracerazor-trice verify-bundle benchmark/trice/results/v2-remote-smoke/trice_remote_smoke_evidence.trice.zip

To inspect local and public trust signals:

tracerazor-trice doctor --format text
tracerazor-trice doctor --format json

The broader bundled smoke uses all six local live tasks through a reusable command adapter profile:

tracerazor-trice suite examples/trice_suite_bundled_live.json \
  --out-dir benchmark/trice/results/v2-broad-smoke \
  --rounds 1 \
  --replicates 1
tracerazor-trice suite readiness examples/trice_suite_bundled_live.json \
  --out docs/trice_suite_readiness.json
tracerazor-trice suite verify-readiness docs/trice_suite_readiness.json \
  --manifest examples/trice_suite_bundled_live.json
tracerazor-trice protocol --manifest examples/trice_suite_bundled_live.json \
  --out docs/trice_protocol_lock.json
tracerazor-trice verify-protocol docs/trice_protocol_lock.json \
  --manifest examples/trice_suite_bundled_live.json
tracerazor-trice design --protocol docs/trice_protocol_lock.json \
  --suite-result benchmark/trice/results/v2-broad-smoke/trice_suite_results.json \
  --out docs/trice_design_card.json
tracerazor-trice verify-design docs/trice_design_card.json \
  --protocol docs/trice_protocol_lock.json \
  --suite-result benchmark/trice/results/v2-broad-smoke/trice_suite_results.json
tracerazor-trice verify-suite benchmark/trice/results/v2-broad-smoke/trice_suite_evidence_manifest.json
tracerazor-trice verify-claim docs/trice_claim_card.json
tracerazor-trice reproduction --out docs/trice_reproduction_card.json
tracerazor-trice verify-reproduction docs/trice_reproduction_card.json
tracerazor-trice contract --out docs/trice_contract_card.json
tracerazor-trice verify-contract docs/trice_contract_card.json
tracerazor-trice install --out docs/trice_install_card.json --dist-dir dist
tracerazor-trice verify-install docs/trice_install_card.json
tracerazor-trice research --out docs/trice_research_card.json
tracerazor-trice verify-research docs/trice_research_card.json
tracerazor-trice artifact --out docs/trice_artifact_card.json
tracerazor-trice verify-artifact docs/trice_artifact_card.json
tracerazor-trice release-evidence --out docs/trice_release_evidence.json --dist-dir dist
tracerazor-trice verify-release-evidence docs/trice_release_evidence.json
tracerazor-trice release --out docs/trice_release_card.json --timeout-s 10
tracerazor-trice verify-release docs/trice_release_card.json
tracerazor-trice crates --out docs/trice_crates_card.json --timeout-s 10
tracerazor-trice verify-crates docs/trice_crates_card.json
tracerazor-trice integrity --out docs/trice_integrity_card.json
tracerazor-trice verify-integrity docs/trice_integrity_card.json

TRICE live input-token savings

TRICE suite readiness preflight

TRICE protocol lock

TRICE statistical design card

TRICE deterministic claim ladder

TRICE remote-git smoke claim ladder

TRICE reproduction card

TRICE public contract card

TRICE research basis card

TRICE public artifact review card

TRICE release evidence

TRICE release card

TRICE crates publish card

TRICE proof graph integrity

Current deterministic smoke evidence: six bundled live repository tasks, fresh copied workspaces, real source edits, pytest verification, and a machine-verifiable manifest. TRICE normalizes verifier duration text and excludes wall-clock fields from the evidence hash, so identical real-run smoke executions are expected to reproduce the same manifest hashes. Mean measured input-token reduction is 81.5% with a deterministic 95% bootstrap CI of 79.0%-83.5%, zero pass regressions, and 100% evidence recall with zero recall failures on all six tasks. This clears the local smoke gate, but it is not yet a broad S-tier claim. Suite results now include a formal s_tier_gate verdict that requires 50 task clusters, 3 replicates per task, locked remote Git sources, adapter profiles, receipt validation, at least 95% evidence recall on solved optimized runs, target-clearing clustered confidence intervals, zero pass regressions, and no unaccepted runs.

Suite reports use clustered-by-task confidence intervals; repeated runs of one repo are not counted as independent held-out repositories. The suite records repo tree fingerprints and intervention provenance before execution: JSON patch tasks record patch-spec SHA-256 hashes, while command adapter tasks record command argv, timeout, and test-edit policy. Adapter profile tasks record profile SHA-256 and profile name. Every condition writes a run_receipt.json with command hashes, before/after workspace fingerprints, changed files, a TRICE context envelope, and optional agent-reported token accounting from TRICE_AGENT_RECEIPT. Command adapters receive TRICE_INPUT_TOKENS, TRICE_BASELINE_INPUT_TOKENS, TRICE_CONTEXT_MODE, and policy hashes in their environment. They also receive TRICE_CONTEXT_PATH, TRICE_POLICY_PATH, TRICE_TRACE_PATH, and TRICE_VERIFY_CMD_JSON, so wrapped agents can read the assembled context, policy, decision trace, and verifier contract they acted on. Run receipts have a shipped JSON Schema and are validated during manifest and bundle verification, so malformed receipts can fail even if their hashes match. Suite reports now include adapter and failure-mode breakdowns. Suite tasks can also clone a locked Git source (url + rev) into a detached, .git-stripped checkout before running. Portable .trice.zip bundles deep-verify the aggregate suite manifest plus all child live manifests. The broad smoke bundle currently contains 77 hashed entries; the remote smoke bundle contains 17. The current public suite smoke correctly reports s_tier_gate.passed = false; held-out external repos and provider adapters are the next proof gate.

The bundled broad smoke uses six task clusters through an adapter profile and reports 81.5% mean input-token reduction, zero pass regressions, 100% evidence recall with zero recall failures, and a 77-entry verified .trice.zip bundle. It still reports s_tier_gate.passed = false because it is local, single-replicate, and not 50 held-out locked remote Git task clusters.

The remote-git smoke fixture clones pypa/sampleproject at 621e4974ca25ce531773def586ba3ed8e736b3fc, records the resolved commit and tree digest, applies a source-only declarative patch, and verifies the objective behavior with a Python command. It reports 83.2% measured input-token reduction, zero pass regressions, 100% evidence recall with zero recall failures, a 17-entry verified .trice.zip bundle, and s_tier_gate.passed = false because it is one public repo and one replicate. This is proof that the remote clone/evidence path works, not a broad product claim.

The generated TRICE Claim Card binds the broad smoke suite result and evidence manifest hashes into a machine-readable public claim boundary that can be checked with tracerazor-trice verify-claim. Current level: smoke; claim allowed: false; determinism contract score: 100/100. The card is intentionally a non-claim until the held-out remote-git suite clears the S-tier gate.

The generated TRICE Remote Smoke Claim Card binds the locked public Git smoke run and its evidence manifest. Current level: smoke; claim allowed: false; determinism contract score: 100/100.

The generated TRICE Suite Readiness card is the no-execution preflight gate. Current level: smoke_ready; pilot-ready: false; claim-ready: false; readiness score: 60/100. It estimates six planned live runs and twelve minimum verifier invocations, then names the missing pilot/claim requirements before any expensive held-out run starts.

The generated TRICE Protocol Lock is the pre-outcome contract for the suite. It freezes the primary metric (input_token_savings), pass-preservation guardrail, clustered confidence rule, target savings, task-cluster and replicate requirements, locked-source rule, adapter-profile rule, receipt rule, claim-card rule, and artifact-card rule before a claim run can be interpreted. Current level: smoke_protocol_locked; protocol score: 81/100; claim allowed by protocol: false.

The generated TRICE Design Card is the statistical design review. It uses task-cluster means from the current broad smoke to project the claim-run lower bound, while still refusing S-tier design when the protocol lacks held-out remote repos and enough replicates. Current level: smoke_design_observed; design score: 65/100; projected claim lower bound: 80.6%; claim design ready: false.

The generated TRICE Reproduction Card is the reviewer runbook. It binds readiness, protocol, design, claim, evidence bundle, paper manifest, paper result, and their exact verifier commands with SHA-256 receipts, while staying outside the paper manifest to avoid a hash cycle. Current level: reviewer_replay_ready_smoke; reproduction score: 100/100; claim allowed: false.

The generated TRICE Contract Card is the public library boundary. It binds the tracerazor and tracerazor.trice import surfaces, tracerazor-trice subcommands, shipped JSON Schemas, examples, and release docs so SemVer claims have a declared API to protect. Current level: library_contract_locked; contract score: 100/100.

The generated TRICE Research Card is the paper-basis boundary. It parses and hash-binds the research ledger, checks minimum source count, unique source URLs, category coverage, row ordering, and non-empty TRICE takeaways, then renders JSON, Markdown, SVG, and LaTeX. Current level: research_basis_locked; research score: 100/100; sources: 165. This proves the literature basis is reviewable. It does not prove the held-out S-tier outcome gate.

The generated TRICE Artifact Card is the reviewer-facing packet for the public evidence. It binds the README, paper source/PDF, paper manifest, readiness card, protocol lock, design card, reproduction card, contract card, installability card, research card, claim card, remote smoke claim card, broad evidence bundle, remote smoke bundle, library doc, and nineteen shipped schemas together, then can be checked with tracerazor-trice verify-artifact. Current level: review_ready_smoke; artifact review score: 100/100; claim allowed: false.

The generated TRICE Installability Card proves the built wheel in a clean virtual environment. It installs the wheel with its MCP dependencies, imports packaged schemas and public APIs, starts the MCP self-test, runs the installed tracerazor-trice console script, audits the packaged sample outside the checkout, and proves the tracerazor launcher resolved its bundled binary. Platform release wheels must reach full_cli_install_ready.

tracerazor-trice install --out docs/trice_install_card.json --dist-dir dist
tracerazor-trice verify-install docs/trice_install_card.json

The generated TRICE Release Evidence packet binds the platform wheel, Rust CLI binary, proof cards including the crates publish card, installability card, and research card, paper artifacts, broad and remote evidence bundles, SHA-256 checksums, CycloneDX-style Python and Cargo SBOMs, and an in-toto/SLSA-shaped provenance statement. It can be checked with tracerazor-trice verify-release-evidence. Current level: release_evidence_ready; release evidence score: 100/100.

The generated TRICE Release Card is the distribution trust gate. It snapshots tracerazor-trice doctor, binds the Artifact, Reproduction, Contract, and Installability Cards, and refuses public_release_ready until the clean platform install card, PyPI, GitHub tag, GitHub Actions, and OpenSSF Scorecard are green. piwheels and crates.io are informational rather than part of the 1.1 GA distribution contract. Current level: local_release_candidate; public release ready: false. The release workflow also generates release-evidence assets and GitHub artifact attestations for wheels, binaries, checksums, and release-evidence sidecars, but those public attestations remain pending until the next GitHub release run.

The generated TRICE Crates Publish Card is the Rust distribution preflight. It binds the workspace manifests, staged publish DAG, crates.io status snapshot, and README install honesty so cargo install tracerazor cannot be reintroduced as public wording before the final CLI crate is live. Current level: publish_plan_locked; crates publish score: 80/100; cargo-install claim allowed: false. The currently publishable first-stage crates are tracerazor-core and tracerazor-semantic.

The generated TRICE Integrity Card is the top-level proof graph gate. It binds offline doctor output, the Contract, Artifact, Reproduction, Release, Release Evidence, Crates, Installability, Research, and paper-manifest verifiers, all shipped schemas, and the CI/release/Scorecard workflow hooks. Current level: proof_graph_integrity_locked; integrity score: 100/100.

Research artifacts:

LaTeX source: paper/trice_v3_research_paper.tex
PDF paper: paper/trice_v3_research_paper.pdf
Claim card: docs/trice_claim_card.md (JSON, SVG)
Remote smoke claim card: docs/trice_remote_smoke_claim_card.md (JSON, SVG)
Readiness card: docs/trice_suite_readiness.md (JSON, SVG)
Protocol lock: docs/trice_protocol_lock.md (JSON, SVG)
Design card: docs/trice_design_card.md (JSON, SVG)
Reproduction card: docs/trice_reproduction_card.md (JSON, SVG)
Contract card: docs/trice_contract_card.md (JSON, SVG)
Research card: docs/trice_research_card.md (JSON, SVG)
Artifact card: docs/trice_artifact_card.md (JSON, SVG)
Installability card: docs/trice_install_card.md (JSON, SVG)
Release evidence: docs/trice_release_evidence.md (JSON, SVG, checksums, Python SBOM, Cargo SBOM, provenance)
Release card: docs/trice_release_card.md (JSON, SVG)
Crates publish card: docs/trice_crates_card.md (JSON, SVG)
Integrity card: docs/trice_integrity_card.md (JSON, SVG)
Library contract: docs/trice_library.md
Evidence manifest: benchmark/trice/results/v2-smoke/trice_v2_evidence_manifest.json
Paper manifest verification: tracerazor-trice verify paper/trice_v3_research_manifest.json --result benchmark/trice/results/v2-smoke/trice_v2_live_results.json
Suite manifest: benchmark/trice/results/v2-suite/trice_suite_evidence_manifest.json
Evidence bundle: benchmark/trice/results/v2-suite/trice_suite_evidence.trice.zip
Broad smoke bundle: benchmark/trice/results/v2-broad-smoke/trice_broad_smoke_evidence.trice.zip
Remote smoke suite manifest: benchmark/trice/results/v2-remote-smoke/trice_suite_evidence_manifest.json
Remote smoke bundle: benchmark/trice/results/v2-remote-smoke/trice_remote_smoke_evidence.trice.zip

The Problem

A substantial fraction of agent tokens is structurally redundant: repeated steps, sycophantic preamble, reformulated context, and unnecessary reasoning loops. The exact share is workload-dependent and we do not claim a universal figure. The most concrete number we can stand behind is our own measurement: across 24 public τ-bench / SWE-agent traces, mean step redundancy is 13% — ~20% on the messy airline subset, 5% on retail, 17% on SWE-agent (after the responsiveness rules: a step answering a new user turn, a successful retry of a failure, or a verification re-run after an edit is never counted as redundant) (see docs/external_agent_audits.md). Treat any broader "30–60%" rule of thumb as an unvalidated heuristic, not a measured constant.

A typical production support agent handling 8 tool calls across 3 loops consumes 15,000-40,000 tokens per resolution:

Pattern	Observed Frequency	Token Impact
Redundant reasoning steps	18-35% of traces	~20% of tokens
Sycophantic / hedging preamble	>60% of outputs	5-15% per step
Input context reformulation	1-3 steps per trace	300-800 tokens each
Unnecessary reasoning depth	~25% of traces	10-30% of tokens
Repeated tool-call loops	~15% of traces	Full loop cost

Mainstream observability tools (LangSmith, Langfuse, Arize, Phoenix) record runs and surface token usage. They do not decompose that usage into structural-waste categories or emit machine-applicable prompt patches. TraceRazor is complementary, not a replacement; it consumes LangSmith / OTEL trace JSON and emits a TAS score plus a fix bundle.

How it compares

Most tools in this space are observability and cost dashboards: LangSmith, Langfuse, Helicone, Arize Phoenix, MLflow, Traceloop, AgentOps, W&B Weave, Braintrust. They tell you how much a run cost and where. TraceRazor instead decomposes that cost into named waste categories, scores it, and emits fix patches, so the two work together: keep your dashboard for capture and monitoring, run TraceRazor on its trace JSON to find and remove waste. Full feature-by-feature breakdown with sources in COMPARISON.md.

Audit

Identify wasted tokens, get fix patches, and estimate monthly savings. No API keys needed. Fast on typical traces — low single-digit milliseconds up to ~50 steps; cost grows with trace length (see the Performance note). Reproduce locally with cargo bench -p tracerazor-core.

How It Works

flowchart TD
    T[Trace JSON] --> P[Parse & Ingest]
    P --> M

    subgraph M["8 Composite Signals (post-normalisation share of TAS)"]
        direction LR
        S1["Step Redundancy\n20.7%"]
        S2["Loop Detection\n15.9%"]
        S3["Tool Accuracy\n15.9%"]
        S4["Reasoning Depth\n12.2%"]
        S5["Info Sufficiency\n12.2%"]
        S7["Context Efficiency\n12.2%"]
        O1["Observation Share\n7.3%"]
        V3["Compression Ratio\n3.7%"]
    end

    subgraph D["6 Diagnostics (detection-only, weight 0 by default)"]
        direction LR
        S6["Token Utilisation"]
        S8["Decision Optimality"]
        S9["Goal Advancement"]
        S10["Semantic Drift"]
        V1["Verbosity Density"]
        V2["Sycophancy/Hedging"]
    end

    P --> D
    M --> W["Weighted Score 0-100 (ordinal)"]
    W --> TAS["TAS - Token Alignment Score"]
    TAS --> G["Grade: Excellent / Good / Fair / Poor"]
    D --> AVS["Verbosity Alert if AVS > 0.40"]
    D --> FX["Per-step annotations + fix patches"]

TAS is ordinal, not cardinal. Most weights are heuristics, not calibrated. The exception is OBS, added after it was the one feature that predicted real recoverable waste and replicated across two datasets (see Better features). Use TAS to track one project over time, not as an absolute percentage. Override via ScoringConfig.weights.

The metrics: 8 composite + 6 diagnostic

A self-evaluation over 61 real traces decided which metrics keep composite weight — criteria pre-stated, data committed, regenerable with python -m benchmark.metric_effectiveness (see docs/metric_effectiveness.md). Shares are post-normalisation (the raw composite weights sum to 0.82; compute() divides by the sum). TUR was demoted in this release on logical grounds: its "useful tokens" are defined as tokens not already flagged by SRR/LDI/TCA, so it double-counted signal the composite already carries.

Composite signals (these shape TAS)

Metric	Share	What It Detects
Step Redundancy Rate (SRR)	20.7%	Near-duplicate steps wasting tokens
Loop Detection Index (LDI)	15.9%	Repeated tool calls re-attempting the same action
Tool Call Accuracy (TCA)	15.9%	Failed tool calls and retries
Reasoning Depth (RDA)	12.2%	Over-deep reasoning for simple tasks
Information Sufficiency (ISR)	12.2%	Steps adding no novel information
Context Efficiency (CCE)	12.2%	Duplicate context across steps
Observation Token Share (OBS)	7.3%	Share of tokens spent on tool I/O vs recoverable reasoning — the one signal validated against real recoverable waste (see Better features)
Compression Ratio (CCR)	3.7%	Highly compressible text

Diagnostics (detection-only: full per-step annotations and fixes, no composite weight by default — re-enable any of them via a weights file)

Metric	Why it left the composite (61 real traces)	Detection role it keeps
Goal Advancement (GAR)	max 0.62 ever observed — constant drag, not discrimination	flags off-goal steps; drives `goal_anchor` fixes
Semantic Drift (CSD)	max 0.68 ever observed; misreads coding workflows as drift	flags drifting step pairs
Decision Optimality (DBO)	sd 0.037 (near-constant); r = 0.76 with TCA	flags sub-optimal tool sequences
Verbosity Density (VDI)	sd 0.038; zero correlation with the composite	drives the AVS verbosity alert + fixes
Token Utilisation (TUR)	circular: "useful tokens" = tokens not already flagged by SRR/LDI/TCA (double-counts their signal); 0.70 divisor uncalibrated	per-step utilisation breakdown
Sycophancy/Hedging (SHL)	sd 0.033 (near-constant)	drives the AVS verbosity alert + hedge fixes

TAS Grade Scale

Grade	Range	Meaning
Excellent	90-100	Minimal recoverable waste
Good	70-89	Addressable inefficiency
Fair	50-69	Significant structural waste
Poor	0-49	Fundamental reasoning issues

Better features (observation accumulation)

When the original 13 metrics were calibrated against real recoverable token waste (tau-bench before/after pairs), no weighting predicted it (negative cross-validated R²). The literature points to context accumulation, verbose/redundant/stale tool observations, as the real cost driver, so we added candidate features measuring it and tested them on two independent real datasets:

Dataset (pairs)	metrics only	metrics + features
tau-bench (233)	-0.11	+0.08
tau2-bench (1,055)	+0.01	+0.12

Adding the features flips real-data cross-validated R² positive on both, with obs_token_share the consistent driver (r = +0.28, +0.33). It is promoted into the composite as the OBS metric. The absolute R² is still modest (~0.1), so this improves the score's grounding without making TAS a strong predictor yet; the remaining candidate features (stale-observation retention, context growth) stay diagnostic in report.features. Reproduce via calibration/ (see DATA_TEMPLATE.md); details in the paper.

Sample Output

tracerazor audit traces/support-agent-run-2847.json

TRACERAZOR REPORT
------------------------------------------------------
Trace:     support-agent-run-2847
Agent:     customer-support-v3
Framework: langgraph
Steps:     11   Tokens: 14280
Analysed:  10ms
------------------------------------------------------
TRACERAZOR SCORE:  83 / 100  [GOOD]  (raw structural: 86, task value: 0.90)
VAE SCORE:         0.77
MVTG:              33.8%  (trace is 33.8% above minimum viable token count)
Note: TAS is an *ordinal* heuristic score — compare runs within one
project over time, not as an absolute efficiency percentage.
------------------------------------------------------
METRIC BREAKDOWN
Code   Metric                         Score    Target   Status
SRR    Step Redundancy Rate           0.0%     <15%     PASS
LDI    Loop Detection Index           0.000    <0.10    PASS
TCA    Tool Call Accuracy             83.3%    >85%     FAIL
RDA    Reasoning Depth Approp.        0.917    >0.75    PASS
ISR    Info Sufficiency Rate          100.0%   >80%     PASS
TUR    Token Utilisation Ratio        0.959    >0.35    PASS
CCE    Context Carry-over Eff.        0.613    >0.60    PASS
CCR    Caveman Compression Ratio      0.384    <0.30    FAIL
OBS    Observation Token Share        0.377    ≥0.30    PASS
-- Diagnostics (no default composite weight) ----------
DBO    Decision Branch Optimality     0.833    >0.70    PASS [cold]
VDI    Verbosity Density Index        0.775    >0.60    PASS
SHL    Sycophancy/Hedging Level       0.219    <0.20    FAIL
GAR    Goal Advancement Ratio         0.403    ≥0.40    PASS  (goal proxy: step 10)
CSD    Cross-Step Semantic Drift      0.438    ≥0.60    FAIL  [drifting pairs: 3→6]
------------------------------------------------------
SUMMARY
The customer-support-v3 agent (langgraph) scored 83/100 [GOOD] — it is reasonably efficient with minor inefficiencies. The biggest efficiency gap is OBS (observation share) (score 0.38/1.0). Issues found: 1 tool misfire(s) wasted ~580 tokens on failed calls. Applying the recommended fixes is estimated to save ~4827 tokens per run ($0.0145/run; ~$724/month projected at an assumed 50000 runs — a heuristic estimate, not a measured re-run).
------------------------------------------------------
PER-STEP ANNOTATIONS
  #  Type         Tokens    Flags
  1  reasoning         820  -
  2  tool_call         340  Compress context: 100% duplicated input context
  3  reasoning        1450  -
  4  tool_call         580  Misfired: wrong params for check_refund_eligibility, retried at step 5
  5  tool_call         620  Compress context: correction of step 4
  6  reasoning        2100  Compress context: 67% duplicated input context
  7  tool_call         380  -
  8  reasoning        2840  Compress context: 60% duplicated input context
  9  tool_call         920  -
 10  reasoning        1680  Compress context: 52% duplicated input context
 11  tool_call        2550  -
------------------------------------------------------
OPTIMAL PATH RECOMMENDATION
Suggested: 10 steps (vs 11 actual)  |  Est. tokens: 9453 (vs 14280)

  KEEP  Step  1  reasoning     Parse the user request. The customer wants a refun
~ TRIM  Step  2  tool_call     Call get_order_details  [Compress context: 100% duplicated input context]
  KEEP  Step  3  reasoning     Parse the user request again. The customer wants a
- DEL   Step  4  tool_call     Call check_refund_eligibility  [Misfired: wrong params for check_refund_eligibility, retried at step 5]
~ TRIM  Step  5  tool_call     Call check_refund_eligibility  [Compress context: correction of step 4]
~ TRIM  Step  6  reasoning     Now I need to think deeply about whether this is t  [Compress context: 67% duplicated input context]
  KEEP  Step  7  tool_call     Call process_refund
~ TRIM  Step  8  reasoning     Let me re-evaluate whether the refund was processe  [Compress context: 60% duplicated input context]
  KEEP  Step  9  tool_call     Call get_order_details
~ TRIM  Step 10  reasoning     Re-evaluating whether the refund was correct. The   [Compress context: 52% duplicated input context]
  KEEP  Step 11  tool_call     Call send_confirmation
------------------------------------------------------
AUTO-GENERATED FIXES
  Fix 1: [tool_schema] → check_refund_eligibility
  Patch: Tool "check_refund_eligibility" failed at step 4 with: "missing required parameter: order_id". Mark the missing parameter(s) as required in the tool schema so the model cannot omit them.
  Est. savings: 580 tokens/run

  Fix 2: [context_compression] → system_prompt
  Patch: Before each tool call, summarise the conversation to the last three relevant facts. Do not re-include information that has already been established earlier in this session.
  Est. savings: 5524 tokens/run

  Fix 3: [goal_anchor] → system_prompt
  Patch: Add to system prompt: "Keep the task objective established at the start of the trace as your working objective. Before acting, check that the action moves measurably closer to it; if it does not, skip it and take the next concrete action instead. Do not restate the objective or summarise progress unless explicitly asked." (Detected drift: trajectory path entropy 1.00 / focus 0.25 (scattered).)
  Est. savings: 445 tokens/run

------------------------------------------------------
PATH ENTROPY  (staying-on-path diagnostic, not part of TAS)
Path entropy:      1.000   (0 = directed, 1 = random walk)
Focus score:       0.250   [scattered]   target ≥ 0.50
Trajectory:        1 advance / 1 stall / 1 regress   (vs final step (no task goal in trace))
   note:           lexical backend — re-run with --enhanced for embedding-based drift
------------------------------------------------------
SAVINGS ESTIMATE  (heuristic projection from flagged waste, not a measured re-run)
Tokens saved:      4827  (33.8% reduction)
Cost saved:        $0.0145 per run
Projected/month:   $724.05  (at an ASSUMED 50000 runs/month — illustration, not your bill; size it with `tracerazor cost <trace> --runs <n>`)

Savings figures are estimates derived from per-fix heuristics, not a measured before/after re-run. Use tracerazor bench to validate a specific patch set against an actual re-run. Numbers above are reproducible from the shipped trace with the command shown.

We measured this gap ourselves with live agent runs: the heuristic estimate can have the wrong sign (round 1 measured the old goal_anchor patch at −5.6% — a cost, not a saving — at constant pass rate, which is why that patch was rewritten). Full data: docs/case_study.md.

Automated Fix Patches

Every audit produces machine-applicable patches tied to the metrics that failed:

"fixes": [
  {
    "fix_type": "termination_guard",
    "target": "system_prompt",
    "patch": "Once search_products returns results, do not call it again for the same query.",
    "estimated_token_savings": 420
  },
  {
    "fix_type": "hedge_reduction",
    "target": "system_prompt",
    "patch": "Do not begin responses with preamble phrases (let me, I'd be happy to, certainly).",
    "estimated_token_savings": 740
  },
  {
    "fix_type": "context_compression",
    "target": "system_prompt",
    "patch": "Summarise conversation to last 3 facts before each tool call.",
    "estimated_token_savings": 183
  }
]

Fix Type	Trigger	Target
`tool_schema`	TCA misfire	Tool's required parameter schema
`termination_guard`	LDI loop	Loop-breaking condition
`context_compression`	CCE bloat	Context summarisation directive
`verbosity_reduction`	VDI fail	Filler-word elimination
`hedge_reduction`	SHL fail	Sycophancy/hedging directive
`reformulation_guard`	Reformulation flag	Skip re-stating input context
`goal_anchor`	GAR/TPE drift	Re-anchor the agent on its task objective

Path Entropy — a real "staying on the path" signal

Most "drift" metrics (including TraceRazor's own GAR and CSD) reduce to a mean cosine similarity. Trajectory Path Entropy (TPE) is different: it is a genuine information-theoretic measure of how directed an agent's run is toward its goal. Each step is scored for goal-progress, the step-to-step increments are classified as advance / stall / regress, and TraceRazor computes the normalised Shannon entropy of that distribution:

H = − Σ p(s)·log2 p(s)      path_entropy = H / log2(3)  ∈ [0, 1]
focus_score = clamp( (directedness + 1)/2 − 0.25·path_entropy , 0, 1 )

Trajectory	path_entropy	focus_score	Reading
Monotonic climb to goal	0.0	1.00	focused
Steady drift away	0.0	0.00	regressing
No movement	0.0	0.50	wandering
Erratic lurching	~1.0	0.25	scattered

TPE is anchored on the real task goal when the trace carries one in metadata (task / goal / objective / …) — otherwise it falls back to the agent's final step and says so via goal_origin. It is reported as a diagnostic alongside TAS, not folded into the composite score, so the published per-metric shares are unchanged. When TPE (or GAR) detects drift, the audit now emits a goal_anchor fix instead of only flagging it.

Honesty note: TPE measures whether a trajectory is directed; it does not by itself prove a fix keeps an agent on task. Use tracerazor bench on a captured before/after trace pair to validate that.

Performance

analyse() is fast on typical traces and scales close to linearly in trace length. Indicative cargo bench -p tracerazor-core numbers (single trace, BoW backend stand-in):

Trace length	Time (measured)
10 steps	~0.13 ms
50 steps	~1.2 ms
200 steps	~8.6 ms
1000 steps	~140 ms

The earlier "sub-5 ms per trace" headline only held below ~70 steps; novelty scanning (ISR) was quadratic. ISR is now bounded to a recent-context window, which roughly halved the 200-step cost and cut the 1000-step cost ~3×. Long traces still grow super-linearly (similarity-based redundancy detection dominates), so we no longer print a single universal figure — run the bench on your own hardware for exact numbers.

Quickstart: Audit

TraceRazor needs at least 5 steps to compute its metrics; shorter traces are skipped with a notice.

from tracerazor import Tracer

with Tracer(agent_name="support-agent", framework="openai") as t:
    t.reasoning("Parse the refund request for order ORD-123.", tokens=180)

    order = lookup_order(order_id="ORD-123")
    t.tool("lookup_order", params={"order_id": "ORD-123"},
           output=str(order), success=True, tokens=90)

    t.reasoning("Order is within the 30-day window; it is eligible.", tokens=160)

    eligible = check_eligibility(order_id="ORD-123")
    t.tool("check_eligibility", params={"order_id": "ORD-123"},
           output=str(eligible), success=True, tokens=110)

    refund = process_refund(order_id="ORD-123")
    t.tool("process_refund", params={"order_id": "ORD-123"},
           output=str(refund), success=True, tokens=140)

    t.reasoning("Refund processed; confirm to the customer.", tokens=120)

report = t.analyse()
print(report.summary())
# TAS <score>/100 [<grade>] | 6 steps, 800 tokens | Estimated <n> tokens
# (exact numbers shift between scorer versions — the shape is the contract)

# For CI, compare this workload with its declared baseline. Do not treat a
# library-default TAS value as a universal quality threshold.

Or via CLI:

# Build the binary
cargo build --release

# Audit a shipped sample trace. Add --threshold only for an explicitly
# documented project-local floor.
tracerazor audit traces/support-agent-run-2847.json

# Hermetic + verifiable: pure function of (trace, config, version)
tracerazor audit traces/support-agent-run-2847.json --hermetic --format json > report.json
tracerazor verify report.json traces/support-agent-run-2847.json

# Compare the same workload before and after a change
tracerazor compare baseline-run.json candidate-run.json --regression-threshold 10

Historical public-trajectory calibration

The repository includes converted τ-bench and SWE-agent trajectories as ingestion and detector-regression fixtures. Their historical report is kept at docs/external_agent_audits.md, with the JSON inputs under traces/external/. It is not a model ranking: TAS is ordinal, those token counts were character estimates, and results from different models or domains must not be averaged or compared as product quality. Use them to reproduce individual findings; use paired, provider- reported, task-preserving reruns for an efficiency claim.

Real ReAct trajectories from Hugging Face (AgentInstruct)

To exercise a different real agent style — tool-using ReAct agents that interleave a thought with a shell/SQL action, rather than function-calling assistants — we also audit trajectories sourced from the Hugging Face dataset zai-org/AgentInstruct (formerly THUDM/AgentInstruct). The corpus is audited end-to-end by a cargo test statistics gate (crates/tracerazor-cli/tests/huggingface_real_data.rs) and summarised in docs/huggingface_agentinstruct_audit.md (reproduce with python -m benchmark.hf_audit_stats; every audit runs in a fresh state directory so measurements are order-independent). These are coverage fixtures, not an absolute TAS benchmark. The exercise mattered because it surfaced — and fixed — a data-fidelity hazard plus four product blind spots that the τ-bench traces did not:

Finding on real ReAct data	Fix
Few-shot scaffolding audited as agent behaviour (every row embeds the dataset's one-shot demo, `loss=false`; it pseudo-replicated one canned trajectory into every trace)	Converter audits only real-task turns via the `loss` flag (text-marker fallback). Mean TAS 82.8→78.0 — the demo was padding every score
Loop detection never fired (`os_6` runs `grep -o "Linux" <FILE> \| wc -l` 4×, but LDI keyed on exact tool+params)	LDI now detects parametric loops — same command template, different argument (LDIₙₒᵣₘ 1.00→0.33 on the clean `os_6`)
GAR/CSD collapsed (ReAct fuses the reasoning into the tool-call turn, which both metrics ignored)	GAR/CSD score tool-call steps carrying substantive reasoning prose, not just `reasoning`-typed steps
Code syntax diluted similarity (wholesale fence-stripping made it worse: CSDₙₒᵣₘ 0.415→0.353 — the argument literals are the goal anchors)	Fenced code is reduced to its argument literals (paths, quoted strings, numbers); syntax dropped (GARₙₒᵣₘ 0.202→0.348 overall)
DBO structurally capped single-tool agents (a bash operator's n calls = n−1 "retries" when keyed on tool name)	Cold-start retry/thrash signals key on the invocation (tool+params): DBOₙₒᵣₘ 0.59→0.88, with the one genuine-failure trace the only one below the ceiling

Coverage finding: with scaffolding excluded, ~69% of real trajectories (9/13) finish in 3–4 steps — below the default 5-step analysis floor. The audit command now takes --min-steps N (default unchanged, clamped ≥2) so short real-world task runs are auditable by explicit opt-in; the gate verifies 13/13 full-corpus coverage.

Auditable runs (provenance): every audit now embeds a run manifest (SHA-256 of the input trace bytes, tool version, timestamp, the similarity backend that actually ran, the exact weights + their hash, step floor, and any store-derived baselines). --hermetic makes the score a pure function of (trace, config, version), and tracerazor verify <report> <trace> re-checks the hash and exactly re-scores hermetic BoW runs — one flipped byte fails verification. Reports also carry AGF (Action/Claim Grounding Fidelity): a deterministic diagnostic measuring how much of what the agent did and concluded is traceable to prior context/observations, with every ungrounded literal itemised (mean 0.854 on the AgentInstruct corpus).

Provenance and a live dataset-viewer fetch path are in traces/external/huggingface/agentinstruct/SOURCE.md; the converter is tools/convert_agentinstruct.py.

Calibrating TAS to your workload

The sub-metrics (all fourteen are computed and fittable; eight carry default composite weight) are combined with weights that are heuristic by default. If you want TAS to be a calibrated indicator for your use case rather than an ordinal one, fit the weights to ground truth with the calibration tool in calibration/. The supported objective is recoverable token waste: the weights are fit so that efficiency (raw TAS / 100) predicts 1 - recoverable_fraction, where the fraction comes from measured before/after re-runs at constant task quality (e.g. your products vs. industry multi-agent baselines).

pip install -e ".[calibrate]"
cargo build --release -p tracerazor

# Your data: a manifest of traces with measured recoverable waste
python -m calibration.calibrate --dataset path/to/manifest.json \
  --out config/tas_weights.json --report config/calibration_report.md

# Use the fitted weights
tracerazor audit run.json --weights config/tas_weights.json
# or globally:  export TRACERAZOR_WEIGHTS=config/tas_weights.json

The tool reports train R², cross-validated R², and the default-weights baseline, so recalibration is only adopted when it demonstrably helps. On a reproducible controlled benchmark of 200 traces with six categories of injected waste, calibrated weights reach cross-validated R² = 0.64 against recoverable waste versus R² = 0.09 for the heuristic defaults (see config/calibration_report.md). That validates the procedure; it is not a claim about any specific production system, which needs your own measured data.

The built-in defaults are left unchanged until you calibrate on your own data, because the injected-waste distribution is a model rather than a sample of real agents, so shipping those weights as the default would swap one unvalidated choice for another. The worked example lives in calibration/; see calibration/README.md.

Measure: from estimated to measured savings

The audit's "tokens saved" figure is a heuristic projection. The apply → re-run → bench loop replaces it with a measured delta, and the measurement harness refuses to call a delta a "saving" on any task whose pass flag flipped:

sequenceDiagram
    participant Agent
    participant TraceRazor
    participant Harness as case_study harness

    Agent->>TraceRazor: before-trace
    TraceRazor->>TraceRazor: audit --hermetic
    TraceRazor->>Agent: apply → patched system prompt
    Agent->>TraceRazor: after-trace (same task, same model)
    TraceRazor->>Harness: bench per pair
    Harness->>Harness: bootstrap 95% CIs + pass-rate check
    Note over Harness: pass flag flipped? ⇒ "FLIPPED, not a saving", exit 1

We ran this loop live — 24 real Claude Code runs over 6 pytest-verified tasks × 2 replicates, ≈$1.30 total — and published both rounds, including the one that went against us:

Round	Patch under test	Mean token Δ (95% CI)	Pass rate
1	`goal_anchor` as shipped	−5.6% [−11.4, +0.2] — a cost	12/12 → 12/12
2	`goal_anchor` rewritten from round-1 evidence	+0.7% [−8.9, +9.9] — cost-neutral	12/12 → 12/12

Round 1's estimate accuracy was −102%: the projection had the wrong sign. That measurement is why the shipped patch no longer tells the agent to restate its objective every step. Full protocol, data, and limitations: docs/case_study.md — every trace, fix, and run log is committed, so the measurement re-runs without API spend. The transcript converter (benchmark/convert_claude_code.py) makes any Claude Code session auditable the same way.

Verify: signed, tamper-evident reports

A score is only evidence if a third party can check it. Every audit embeds a run manifest (trace SHA-256, tool version, similarity backend, exact weights + hash, thresholds, store-derived baselines), and verify re-checks it; hermetic bag-of-words runs are re-scored exactly, field by field.

For adversarial settings (compliance hand-offs, vendor claims), sign the report:

# One-time: generate an Ed25519 keypair
tracerazor keygen
# TRACERAZOR_SIGNING_KEY=...   (private — keep secret, e.g. a CI secret)
# TRACERAZOR_VERIFY_KEY=...    (public — distribute freely)

# Audits signed with the key embed a signature over the canonical report
export TRACERAZOR_SIGNING_KEY=<key>
tracerazor audit traces/support-agent-run-2847.json --hermetic --format json > report.json

# Verification checks the signature FIRST: any edited field — TAS, AGF,
# savings, fixes, summary, even the similarity-backend claim — exits 1 TAMPERED
tracerazor verify report.json traces/support-agent-run-2847.json
# signature       : OK (Ed25519)
# trace hash      : OK (...)
# re-score        : OK (all metrics match)
# verified        : full (Ed25519-authenticated + reproduced from trace, manifest, version)

Unsigned reports never get a "full" verdict — they verify as rescore-only (unsigned) at best. For WORM hand-offs, tracerazor export <trace> --bundle evidence.zip packs trace + signed report + weights + SHA256SUMS into one archive that tracerazor verify evidence.zip checks end-to-end (no separate trace argument needed).

Labs (experimental)

The two sections below are research tracks, not part of the supported product surface. Their results are preliminary (single-seed sampling benchmark; substitutability validated only on synthetic data) — treat them as directional until the caveats inside each section are resolved.

Labs: Adaptive Sampling (experimental)

Two drop-in LangGraph strategies, AdaptiveKNode (per-step parallel sampling) and SelfConsistencyBaseline (re-sample the final answer only). SelfConsistency is the default and the Pareto winner on tau-bench airline; AdaptiveK is a targeted tool for mid-trajectory branching failures, not a free uplift.

How It Works

  Agent Step
       │
       ▼
  ┌─────────────────────────────────────────────┐
  │         AdaptiveKNode (K candidates)         │
  │                                             │
  │  LLM call 1 ──►  branch_1                   │
  │  LLM call 2 ──►  branch_2  ──►  consensus   │──► best response
  │  LLM call 3 ──►  branch_3        winner     │
  │      ...                                    │
  │                                             │
  │  K shrinks when all agree  (saves tokens)   │
  │  K resets after mutating tool calls         │
  └─────────────────────────────────────────────┘

K adapts automatically:

Shrinks toward k_min when all candidates produce the same tool call (full consensus = lower uncertainty)
Resets to k_max after a divergent vote or a state-mutating tool call (e.g. booking a flight, cancelling an order)

Benchmark Results (tau-bench airline, 50 tasks, gpt-4o)

xychart-beta
    title "Task Pass Rate vs Token Cost Multiplier"
    x-axis ["K=1 baseline", "NaiveK5", "AdaptiveK5", "SelfConsistency K5"]
    y-axis "pass^1 (%)" 30 --> 55
    bar [38, 40, 46, 48]

Strategy	pass^1	Mean tokens	vs baseline	Notes
K=1 baseline	38%	63k	1.0x	Single deterministic pass
NaiveKEnsemble (K=5)	40%	282k	4.5x	5 full independent agents, majority vote
AdaptiveKNode (K=5)	46%	246k	3.9x	Per-step adaptive sampling
SelfConsistency (K=5)	48%	137k	2.2x	Deterministic tools, re-sample final answer

Pareto winner: SelfConsistency at K=5: highest pass rate (48%) at the lowest cost multiplier (2.2x); ~285k tokens per successful task vs ~535k for AdaptiveK.

When to use which

SelfConsistencyBaseline (default). Most failures are wrong final-answer formatting. Resampling the terminal answer at K=5 fixes them for ~1/N the cost of full-step ensembling. Pick this unless you have evidence that mid-trajectory branching is your failure mode.
AdaptiveKNode. Use when failures look like mid-trajectory problems rather than final-answer problems. Symptoms include K=1 runs that loop until the step cap, agents that pick a wrong tool early and never recover, or domains where wrong mutating actions are expensive enough that catching pre-commit disagreement justifies the Kx cost. On tau-bench airline AdaptiveK uniquely solved 6/50 tasks (notably one that K=1 and SelfConsistency both abandoned at the step cap), but lost 4 tasks that K=1 had passed cleanly. Expect gains on the hard tail and regressions on easy tasks.
NaiveKEnsemble. Not recommended. Failures correlate across independent runs, so a majority vote does not recover them.

The K-shrink on consensus does work: AdaptiveK uses ~42% fewer fresh (non-cached) tokens than NaiveK5, but the saving is not enough to overcome SelfConsistency's structural advantage of skipping intermediate ensembling entirely.

Quickstart: Sampling

from tracerazor import AdaptiveKNode, openai_llm
from openai import AsyncOpenAI
from langgraph.graph import StateGraph

llm = openai_llm(AsyncOpenAI(), model="gpt-4.1")
node = AdaptiveKNode(llm=llm, tools=my_tools, k_max=5, k_min=2)

graph = StateGraph(AgentState)
graph.add_node("agent", node)
# ... add edges and compile as usual ...

result = await graph.ainvoke({"messages": [HumanMessage(content="...")]})
print(result["consensus_report"].summary())

Or use the baselines directly:

from tracerazor import SelfConsistencyBaseline, NaiveKEnsemble

# Deterministic tools + K re-sampled final answers
sc = SelfConsistencyBaseline(llm=llm, tools=my_tools, k=5)
result = await sc.run(task)

# K fully independent agent runs
naive = NaiveKEnsemble(llm=llm, tools=my_tools, k=5)
result = await naive.run(task)

Labs: Substitutability Classifier (experimental)

Predict whether a cached LLM response can safely replace a fresh response to a new prompt. Every correct positive saves one full LLM round-trip.

How It Works

           ┌─────────────────────────────────────────────────────┐
           │               Substitutability Decision              │
           │                                                     │
  prompt_B │                                                     │  substitutable?
  ─────────►   cos(embed(pA), embed(pB))                        ├──────────────►
           │   cos(embed(rA), embed(pB))  ──►  Classifier  ──►  │  YES → reuse
 response_A    jaccard overlap                                   │   NO → new call
  ─────────►   length ratios                                     │
           │                                                     │
           └─────────────────────────────────────────────────────┘

Pass criteria: precision ≥ 80% AND recall ≥ 30% simultaneously at the same operating threshold. A wrong substitution silently corrupts the agent trajectory; that is costlier than a missed cache hit.

Feature Tiers

  ┌──────────┬────────────────────────────────────────────────────────────┐
  │  Tier    │  Features                                                  │
  ├──────────┼────────────────────────────────────────────────────────────┤
  │  emb     │  cos(embed(pA), embed(pB))  - prompt semantic similarity   │
  │          │  cos(embed(rA), embed(pB))  - response-to-new-prompt match │
  │          │  cos(embed(rA), embed(pA))  - response quality anchor      │
  ├──────────┼────────────────────────────────────────────────────────────┤
  │  scalar  │  jaccard(pA, pB)            - word overlap                 │
  │          │  len(pB) / len(pA)          - relative prompt length       │
  │          │  len(rA) / len(pB)          - response size vs new prompt  │
  │          │  jaccard(rA, pB)            - response word overlap        │
  │          │  common_prefix_frac         - positional prompt similarity  │
  ├──────────┼────────────────────────────────────────────────────────────┤
  │  both    │  All 8 features above                                      │
  └──────────┴────────────────────────────────────────────────────────────┘

Embeddings: all-MiniLM-L6-v2, 22M parameters, 384-dim, fully offline.

Synthetic Sanity Check: NOT a generalisation estimate

The results below are on 186 synthetic Claude-generated records drawn from 20 airline scenario templates. Every config trivially separates this distribution because the generator was instructed with the target label (a form of label leakage), and templates leak across the train/test split. Treat these numbers as a pipeline-wiring smoke test, not a classifier-skill estimate. Projected real-data AUC: 0.70-0.90 (consistent with the scalar-tier CV AUC of 0.90 +/- 0.05, the only number here not corrupted by template leakage). The eval pipeline (tracerazor/redundancy/evaluate_full.py) has been hardened to use StratifiedGroupKFold keyed by template_id, re- run against real tau-bench transcripts before quoting any number in production.

xychart-beta
    title "AUC-ROC on synthetic data (pipeline smoke test)"
    x-axis ["logreg/emb", "logreg/scalar", "logreg/both", "gbm/emb", "gbm/scalar", "gbm/both"]
    y-axis "AUC-ROC" 0.5 --> 1.0
    bar [1.0000, 0.9856, 1.0000, 1.0000, 0.9978, 1.0000]

Configuration	CV ROC mean+/-std	Test ROC	Test PR	Precision	Recall	On-synthetic
logreg/emb	1.000 +/- 0.000	1.000	1.000	81.1%	100.0%	pass
logreg/scalar	0.900 +/- 0.051	0.986	0.987	81.1%	100.0%	pass
logreg/both	1.000 +/- 0.000	1.000	1.000	81.1%	100.0%	pass
gbm/emb	1.000 +/- 0.000	1.000	1.000	81.1%	100.0%	pass
gbm/scalar	0.923 +/- 0.046	0.998	0.998	81.1%	100.0%	pass
gbm/both	1.000 +/- 0.000	1.000	1.000	81.1%	100.0%	pass

All six configs pass on synthetic data; none of these is a deployable threshold. Re-validate on your own transcripts before production use.

GBM Feature Importance (emb tier):

xychart-beta
    title "GBM Feature Importances"
    x-axis ["cos_pA_pB", "cos_rA_pB", "cos_rA_pA"]
    y-axis "Importance" 0 --> 1.0
    bar [0.6153, 0.3789, 0.0058]

cos_pA_pB (prompt semantic similarity) dominates at 61.5%, but on the synthetic data this likely means the classifier is recovering the template identity, not learning substitutability. Re-evaluate feature importances once real-data labels are in place.

Quickstart: Substitutability Classifier

# Generate synthetic training data (requires ANTHROPIC_API_KEY in .env)
python -m tracerazor.redundancy.generate_data --n 300 --run-id run_synthetic
python -m tracerazor.redundancy.generate_data --n 100 --run-id run_v3 --out results/run_v3/judge_transcripts.jsonl

# Full evaluation: 5-fold CV, bootstrap CI, PR curves, confusion matrices, feature importance
python -m tracerazor.redundancy.evaluate_full --results-dir results --test-run run_v3
# Writes docs/findings_v5.md

import pandas as pd
from tracerazor.redundancy.substitutability import build_features, train, evaluate, load_labels, split_by_run

df = load_labels("results")
df_train, df_test = split_by_run(df, test_run_pattern="run_v3")

logreg, gbm = train(df_train, tier="emb")
result = evaluate(logreg, df_test)
print(result)   # EvalResult(auc_roc=1.0, passes=True, ...)

# Single pair inference
df_pair = pd.DataFrame([{
    "prompt_A": "Book AA100 JFK to LAX June 15",
    "response_A": "Found AA100 departing 08:00, $450.",
    "prompt_B": "Book AA100 JFK to LAX June 16",
}])
X = build_features(df_pair, tier="emb")
prob = logreg.pipeline.predict_proba(X)[0, 1]
substitutable = prob >= 0.015

Full findings and methodology: docs/findings_v5.md

Install

pip install tracerazor                    # core: audit + agent runtime
pip install "tracerazor[openai]"          # OpenAI adapter for AdaptiveKNode
pip install "tracerazor[anthropic]"       # Anthropic adapter
pip install "tracerazor[langgraph]"       # LangGraph integration
pip install "tracerazor[all]"             # everything

For the substitutability classifier:

pip install sentence-transformers scikit-learn pandas numpy anthropic

End-to-End Example

# Step 1: Instrument and audit
from tracerazor import Tracer

with Tracer(agent_name="support-agent", framework="openai") as t:
    response = llm.invoke(prompt)
    t.reasoning(response.text, tokens=response.usage.total_tokens)
    result = lookup_order(order_id="ORD-123")
    t.tool("lookup_order", params={"order_id": "ORD-123"},
           output=str(result), success=True, tokens=80)

report = t.analyse()
print(report.summary())
# Gate the same workload with `tracerazor compare` after declaring a baseline.

# Step 2 (Labs): experiment with adaptive sampling; no general efficacy claim
from tracerazor import AdaptiveKNode, openai_llm, SelfConsistencyBaseline

llm_node = openai_llm(AsyncOpenAI(), model="gpt-4.1")
node = AdaptiveKNode(llm=llm_node, tools=my_tools, k_max=5)
# ... wire into LangGraph graph ...

# Step 3 (Labs): experiment with substitutability before each LLM call
from tracerazor.redundancy.substitutability import build_features
import pandas as pd

df = pd.DataFrame([{
    "prompt_A": cached_context,
    "response_A": cached_response,
    "prompt_B": current_context,
}])
X = build_features(df, tier="emb")
if trained_classifier.predict_proba(X)[0, 1] >= 0.015:
    response = cached_response   # reuse: save one LLM call
else:
    response = await llm(current_context)

Integrations

The supported agent-native adapters feed tracerazor.runtime so provider usage keeps its provenance and missing counts are never replaced with character estimates. Install the matching extra for the framework SDK.

pip install tracerazor                    # core: audit + agent runtime
pip install "tracerazor[langgraph]"       # adds TraceRazorCallback
pip install "tracerazor[crewai]"
pip install "tracerazor[agents]"          # OpenAI Agents SDK
pip install "tracerazor[redundancy]"      # substitutability classifier
pip install "tracerazor[all]"

LangGraph

from tracerazor.runtime import auto_instrument, configure

runtime = configure(policy_path="tracerazor.toml", framework="langgraph")
handle = auto_instrument("langgraph", processor=runtime).handles["langgraph"]
result = handle.invoke(graph, {"messages": [...]})

CrewAI

from tracerazor.runtime import auto_instrument, configure

runtime = configure(policy_path="tracerazor.toml", framework="crewai")
handle = auto_instrument("crewai", processor=runtime).handles["crewai"]
handle.attach(crew)
try:
    output = crew.kickoff()
    handle.finish(output=output)
finally:
    handle.detach()

OpenAI Agents SDK

from tracerazor.runtime import auto_instrument, configure

runtime = configure(policy_path="tracerazor.toml", framework="openai-agents")
result = auto_instrument("openai_agents", processor=runtime)
await Runner.run(agent, "I need a refund for order ORD-9182")

The older tracerazor.integrations.* callbacks remain import-compatible during 1.x, but may estimate usage and use their historical absolute-threshold helper; their output is advisory and not enforcement-grade.

GitHub Actions CI Gate

Works from any repo — the action downloads a prebuilt release binary (no Rust toolchain), parses the report JSON (a malformed report fails the step instead of silently scoring 0), posts a sticky PR comment, and uploads the JSON report as an artifact.

permissions:
  pull-requests: write # for the sticky PR comment

- uses: ZulfaqarHafez/TraceRazor/.github/actions/tracerazor@v1.1.0
  with:
    trace-file: traces/latest-run.json
    # Preferred gate: compare the same workload to a declared baseline.
    baseline-trace: traces/support-agent-run-2847.json
    regression-threshold: '10' # fail on any metric dropping >10%

Outputs: tas-score, grade, passes, regression-detected, tokens-saved, report, report-json-path. Exits 1 if an explicitly supplied project TAS floor fails or a same-workload metric regression exceeds the threshold; exits 2 (without inventing a score) on broken input.

Framework	Adapter
LangGraph / LangChain	Per-invocation runtime callback handle + LangSmith / OTLP ingest
OpenAI Agents SDK	Process-global tracing processor with per-trace runtime isolation
CrewAI	Explicit, detachable event-bus runtime handle
OTLP-instrumented agents	Authenticated local OTLP/HTTP JSON receiver
Raw / custom	Python SDK or JSON file

CLI Reference

tracerazor <COMMAND>

Commands:
  audit      Score a trace file; gate CI on --threshold <N>
  import     Normalize raw/LangSmith/OTel/Phoenix/Langfuse/Claude Code exports
  claude     Install/run Claude Code coach hooks and convert transcripts
  verify     Verify a report (or evidence bundle .zip) — signature, hash, re-score
  keygen     Generate an Ed25519 keypair for report signing
  optimize   Labs-only prompt experiment; does not prove an improvement
  apply      Patch a system prompt file with safe, non-functional fixes
  bench      Compare before/after traces and verify actual savings
  compare    Per-metric delta table between two trace files
  simulate   Project TAS impact of removing or merging steps
  cost       Estimated monthly cost projection across a set of traces
  export     Forward a report to OTEL/webhook, or pack an evidence bundle
  serve      Start the REST API, dashboard, and local OTLP/HTTP JSON receiver
  list       List traces stored in the current session

# Batch mode is diagnostic only. Do not gate on fleet-average TAS or compare
# unrelated agents; declare same-workload baselines instead.
tracerazor audit traces/external/ --min-steps 2 --hermetic --format json

tracerazor compare before.json after.json
tracerazor simulate trace.json --remove 3,8 --merge 6,7
tracerazor cost trace*.json --provider anthropic-claude-3-5-sonnet --runs 50000
# Labs only; accept nothing until a task-preserving rerun verifies it.
tracerazor optimize trace.json --system-prompt agent.txt --output agent_v2.txt

# Universal import + coach artifacts:
tracerazor import exports/langfuse.json --from langfuse --out trace.json --audit
tracerazor import exports/ --from auto --out normalized/ --audit

# Signed evidence bundle (trace + signed report + weights + SHA256SUMS),
# verifiable as a single file:
tracerazor export traces/support-agent-run-2847.json --bundle evidence.zip
tracerazor verify evidence.zip

LLM backend (for optimize and --enhanced):

export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
# or OpenAI-compatible (Ollama, vLLM, Groq, Together, LM Studio)
export TRACERAZOR_LLM_PROVIDER=openai-compatible
export TRACERAZOR_LLM_BASE_URL=http://localhost:11434/v1
export TRACERAZOR_LLM_MODEL=llama3.1

REST API

Start: tracerazor serve (alias for ./target/release/tracerazor-server).

The audit endpoint takes a {"trace": ...} envelope — the raw trace JSON (same schema as the CLI) wrapped in a trace key.

By default the server scores with the agent's accumulated history (RDA/DBO baselines from its store) and persists the run — so repeat audits of the same trace can legitimately differ from a fresh tracerazor audit. The response's manifest records exactly which baselines influenced the score. Add "hermetic": true to the request body to score as a pure function of (trace, config, version) — identical to tracerazor audit --hermetic — with no store reads or writes:

tracerazor serve --port 8080 &

curl -s -X POST http://127.0.0.1:8080/api/audit \
  -H "Content-Type: application/json" \
  -d '{"trace": {"trace_id": "t1", "agent_name": "support", "framework": "raw",
        "steps": [{"id": 1, "type": "reasoning", "content": "...", "tokens": 100}]}}'

Auth: set TRACERAZOR_API_TOKEN to require Authorization: Bearer <token> on every /api route and /ws; requests without it get 401. The server binds loopback by default. A non-loopback bind is refused unless both bearer auth and a real TLS-terminating reverse-proxy boundary are configured. Health probes (/healthz, /readyz) stay open for orchestrators.

TRACERAZOR_API_TOKEN=s3cret tracerazor serve --bind 127.0.0.1 --port 8080 &
curl -s -H "Authorization: Bearer s3cret" http://127.0.0.1:8080/api/traces

For the local HTTPS dashboard container, use the fail-closed Compose setup. Do not set TRACERAZOR_TLS_TERMINATED=true unless a real proxy is already terminating TLS and direct backend access is blocked.

Method	Path	Description
`POST`	`/api/audit`	Score a trace (`{"trace": ...}` envelope, optional `"hermetic": true`); auto-captures to KB if TAS >= 85
`GET`	`/api/traces`	List stored traces
`GET/DELETE`	`/api/traces/:id`	Full trace + report / delete
`GET`	`/api/dashboard`	Aggregate stats
`GET`	`/api/agents`	Per-agent stats, worst-first
`GET`	`/api/compare?a=:id&b=:id`	Metric diff between two traces
`WS`	`/ws`	Live audit event stream
`GET`	`/api/metrics`	Prometheus exposition

Architecture

How a trace flows through the crates:

flowchart LR
    IN["tracerazor-ingest<br/>raw JSON · LangSmith · OTel"] --> CORE["tracerazor-core<br/>8+6 metrics · TAS · fixes · manifest<br/>(zero network deps)"]
    SEM["tracerazor-semantic<br/>BoW default · LLM opt-in"] -.-> CORE
    STORE["tracerazor-store<br/>SQLite baselines · history"] -.-> CORE
    CORE --> CLI["tracerazor-cli<br/>audit · verify · bench · apply · serve"]
    CORE --> SRV["tracerazor-server<br/>REST · WebSocket · dashboard"]
    CLI --> OUT["report.json<br/>signed, verifiable"]
    SRV --> OUT

Repository layout:

tracerazor/
├── crates/
│   ├── tracerazor-core/       # 8 composite + 6 diagnostic metrics, TAS, fixes, IAR
│   ├── tracerazor-ingest/     # Parsers: raw JSON, LangSmith, OpenTelemetry
│   ├── tracerazor-semantic/   # BoW similarity + LLM backend (OpenAI / Anthropic / compatible)
│   ├── tracerazor-store/      # SQLite: traces, KB, baselines, anomaly detection
│   ├── tracerazor-server/     # Axum REST + WebSocket + embedded dashboard
│   └── tracerazor-cli/        # CLI entry point; persistent store at ~/.tracerazor/
│
├── tracerazor/                # Single Python package (pip install tracerazor)
│   ├── _audit_tracer.py       # Tracer context manager
│   ├── _audit_client.py       # TraceRazorClient + TraceRazorReport
│   ├── _adaptive_k.py         # AdaptiveKNode (LangGraph node)
│   ├── _self_consistency.py   # SelfConsistencyBaseline
│   ├── _naive_ensemble.py     # NaiveKEnsemble
│   ├── _consensus.py          # ExactMatchConsensus, BranchProposal, Outcome
│   ├── _adapters.py           # openai_llm, anthropic_llm, mock_llm
│   ├── integrations/          # opt-in framework adapters
│   │   ├── langgraph/         # pip install "tracerazor[langgraph]"
│   │   ├── crewai/            # pip install "tracerazor[crewai]"
│   │   └── openai_agents/     # pip install "tracerazor[agents]"
│   └── redundancy/            # pip install "tracerazor[redundancy]"
│       ├── substitutability.py
│       ├── generate_data.py   # synthetic transcript generator (Anthropic)
│       └── evaluate_full.py   # 5-fold CV, group-aware split, control baseline
│
├── examples/                  # framework-specific end-to-end snippets
│   ├── langgraph/
│   ├── crewai/
│   └── openai_agents/
│
├── benchmark/
│   ├── case_study.py          # measured-savings harness: bench per pair + bootstrap CIs
│   ├── convert_claude_code.py # Claude Code transcript → auditable trace
│   └── live/                  # live-study kit: task suite, runner, both rounds' data
│
├── docs/
│   ├── case_study.md          # MEASURED live case study: 24 real agent runs, CIs, both rounds
│   ├── findings_v5.md         # Substitutability study: full results + Mermaid charts
│   └── tau_bench_benchmark_report.md  # Pareto analysis of sampling strategies
│
└── .github/                   # CI workflow + composite GitHub Action

tracerazor-core has zero network dependencies; offline analysis never pulls in reqwest. The semantic, server and substitutability components are opt-in and call out to LLM / embedding services; --enhanced activates at runtime without recompiling.

Test Coverage

Reproduce with cargo test --workspace and pytest.

Crate / Module	Tests
tracerazor-core	164
tracerazor-ingest (incl. golden files)	7
tracerazor-semantic	22
tracerazor-store	10
tracerazor-server (incl. auth)	24
tracerazor-cli (2 unit + 22 integration)	24
Doc-tests	9
Total Rust	260, all pass
Python (pytest)	254 pass, 4 skipped

Research Foundation

#	Paper	Informs
[1]	Han et al. (2024). Token-Budget-Aware LLM Reasoning (TALE). ACL 2025.	TUR, CCE
[2]	Zhao et al. (2025). SelfBudgeter: Adaptive Token Allocation.	Adaptive sampling
[3]	Lee et al. (2025). Evaluating Step-by-step Reasoning Traces: A Survey.	Framework basis
[4]	Su et al. (2024). Dualformer: Controllable Fast and Slow Thinking.	RDA
[5]	Wu et al. (2025). Step Pruner: Efficient Reasoning in LLMs.	Optimal path diff
[6]	Feng et al. (2025). Efficient Reasoning Models: A Survey.	Metric validation
[7]	Pan et al. (2024). ToolChain: A Search for Tool Sequences**. NeurIPS 2024.	DBO, KB design
[8]	Hassid et al. (2025). Reasoning on a Budget.	VAE score
[9]	(2025). Balanced Thinking (SCALe-SFT).	Efficiency without accuracy loss
[10]	Mohammadi et al. (2025). Evaluation and Benchmarking of LLM Agents. KDD 2025.	Composite scoring
[11]	Shi et al. (2024). Verbosity Bias in LLM Responses.	VDI, SHL, CCR design
[12]	Zeng et al. (2023). AgentTuning: Enabling Generalized Agent Abilities for LLMs (AgentInstruct dataset).	Real ReAct-trace evaluation; LDI/GAR/CSD validation

Limitations & Honest Caveats

TraceRazor is a useful, fast heuristic tool. It is not a validated scientific instrument, and we want to be precise about what it does and does not establish:

TAS is an ordinal heuristic by default. The built-in composite weights are author-chosen and not fit against a labelled corpus, so out of the box TAS is best used to compare runs of the same agent over time, not as an absolute cross-agent percentage. Some sub-metrics intentionally pull in opposite directions (e.g. redundancy vs. continuity), so there is no single optimal point. You can replace the heuristic weights with data-calibrated ones, see Calibrating TAS.
Savings and dollar figures are estimates, not measurements. They are the sum of per-fix heuristic projections, not a measured before/after re-run at constant task quality. To validate a concrete patch set, capture a real "after" trace and use tracerazor bench.
The sampling benchmark is preliminary. The published numbers are single-seed over 50 tasks with no confidence intervals; small differences between strategies are within noise. Treat the "Pareto" framing as directional, and re-run with multiple seeds (--seeds) before drawing firm conclusions. The in-package SelfConsistencyBaseline selects answers by honest majority vote (no oracle).
The substitutability classifier result is a pipeline smoke test, not a generalisation estimate. It is trained and evaluated on synthetic templated pairs where the label is essentially topic identity; the headline AUC reflects label leakage by construction. See the dedicated section above.
IAR (adherence) is closed-loop self-validation: it checks whether the tool's own fixes improved the tool's own metrics, with no external ground truth such as task success or human judgement.
Performance is a few milliseconds for typical traces (tens of steps) and scales roughly linearly with step count; multi-thousand-step traces take proportionally longer.

For a fuller, paper-style treatment of the methodology and its limitations, see paper/tracerazor.tex.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Jul 10, 2026

1.0.3

Jun 27, 2026

1.0.2

Jun 21, 2026

1.0.1

May 24, 2026

1.0.0

Apr 19, 2026

0.2.0

Apr 16, 2026

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracerazor-1.1.0-py3-none-win_amd64.whl (6.9 MB view details)

Uploaded Jul 10, 2026 Python 3Windows x86-64

tracerazor-1.1.0-py3-none-manylinux_2_39_aarch64.whl (7.9 MB view details)

Uploaded Jul 10, 2026 Python 3manylinux: glibc 2.39+ ARM64

tracerazor-1.1.0-py3-none-manylinux_2_35_x86_64.whl (7.9 MB view details)

Uploaded Jul 10, 2026 Python 3manylinux: glibc 2.35+ x86-64

tracerazor-1.1.0-py3-none-macosx_11_0_x86_64.whl (7.4 MB view details)

Uploaded Jul 10, 2026 Python 3macOS 11.0+ x86-64

tracerazor-1.1.0-py3-none-macosx_11_0_arm64.whl (7.0 MB view details)

Uploaded Jul 10, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file tracerazor-1.1.0-py3-none-win_amd64.whl.

File metadata

Download URL: tracerazor-1.1.0-py3-none-win_amd64.whl
Upload date: Jul 10, 2026
Size: 6.9 MB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracerazor-1.1.0-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`2fee2ec443bc0b57b9756f1186037df4bd9d2677dacd566877222704226652b4`
MD5	`a5249d6b345abca20b190c83a2818911`
BLAKE2b-256	`897c87fa878e0209d991fee56a4fbb3ebf507c5c827fc453b8733114f072abaa`

See more details on using hashes here.

File details

Details for the file tracerazor-1.1.0-py3-none-manylinux_2_39_aarch64.whl.

File metadata

Download URL: tracerazor-1.1.0-py3-none-manylinux_2_39_aarch64.whl
Upload date: Jul 10, 2026
Size: 7.9 MB
Tags: Python 3, manylinux: glibc 2.39+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracerazor-1.1.0-py3-none-manylinux_2_39_aarch64.whl
Algorithm	Hash digest
SHA256	`8259c6a297d70727552a301dcf75d87e49a4bb974074e5b031e574b8659d1d5a`
MD5	`5a565d448b023a2669985f29cccff403`
BLAKE2b-256	`2d6cfae3418734e6f5e146dce46af381d838b0f2eb9de4f6cb3c826efedf816f`

See more details on using hashes here.

File details

Details for the file tracerazor-1.1.0-py3-none-manylinux_2_35_x86_64.whl.

File metadata

Download URL: tracerazor-1.1.0-py3-none-manylinux_2_35_x86_64.whl
Upload date: Jul 10, 2026
Size: 7.9 MB
Tags: Python 3, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracerazor-1.1.0-py3-none-manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`24ab749926f81fc4b6a6a95724689ad8bd8bbe79b0813eff07ae2d8ce5b6a361`
MD5	`45cd967acbe5b1ea9977545ee2bae246`
BLAKE2b-256	`598e6c212e4e32e9ca9fd7a6db0e0f34a2dfbbe9d476054beba0049dec424737`

See more details on using hashes here.

File details

Details for the file tracerazor-1.1.0-py3-none-macosx_11_0_x86_64.whl.

File metadata

Download URL: tracerazor-1.1.0-py3-none-macosx_11_0_x86_64.whl
Upload date: Jul 10, 2026
Size: 7.4 MB
Tags: Python 3, macOS 11.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracerazor-1.1.0-py3-none-macosx_11_0_x86_64.whl
Algorithm	Hash digest
SHA256	`377f31d43da17a2c93584a898494aa6718311a2a4a939e75af9c27a1d785906d`
MD5	`d5faee092200abafb3942b3fac7460f7`
BLAKE2b-256	`77b739bf648e6e4b60d87e5e73a7489d97ae278677b03a3955db1f878ee0431d`

See more details on using hashes here.

File details

Details for the file tracerazor-1.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: tracerazor-1.1.0-py3-none-macosx_11_0_arm64.whl
Upload date: Jul 10, 2026
Size: 7.0 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tracerazor-1.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`0c53225af4e5bc6d4b16244fcc23feb3eb0f94013a54ac735d84aeb12f0b904a`
MD5	`b1ea60ceafca63fde94a7de116e48d80`
BLAKE2b-256	`6585ce4e4b9b0468863b879fca423e1f5fef888491134728210a14e2b6518b86`

See more details on using hashes here.

tracerazor 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TraceRazor

60-second start

For AI agents

What TraceRazor Does

TRICE (Labs): deterministic live context control

The Problem

How it compares

Audit

How It Works

The metrics: 8 composite + 6 diagnostic

Better features (observation accumulation)

Sample Output

Automated Fix Patches

Path Entropy — a real "staying on the path" signal

Performance

Quickstart: Audit

Historical public-trajectory calibration

Real ReAct trajectories from Hugging Face (AgentInstruct)

Calibrating TAS to your workload

Measure: from estimated to measured savings

Verify: signed, tamper-evident reports

Labs (experimental)

Labs: Adaptive Sampling (experimental)

How It Works

Benchmark Results (tau-bench airline, 50 tasks, gpt-4o)

When to use which

Quickstart: Sampling

Labs: Substitutability Classifier (experimental)

How It Works

Feature Tiers

Synthetic Sanity Check: NOT a generalisation estimate

Quickstart: Substitutability Classifier

Install

End-to-End Example

Integrations

LangGraph

CrewAI

OpenAI Agents SDK

GitHub Actions CI Gate

CLI Reference

REST API

Architecture

Test Coverage

Research Foundation

Limitations & Honest Caveats

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata