Three-axis efficiency scoring for Claude Code sessions — token economy, trajectory quality, deterministic waste. Self-hosted; nothing leaves your machine.

These details have not been verified by PyPI

Project links

Repository

Project description

tracegauge

Three-axis efficiency scoring for Claude Code sessions — token economy, trajectory quality, deterministic waste. Runs entirely on your machine. Nothing leaves.

Quick start

pip install tracegauge

# Background watcher + localhost dashboard (http://127.0.0.1:4747/)
tes serve

# Score a single session
tes score ~/.claude/projects/<project-id>/<session-id>.jsonl

# Score all sessions in a project directory
tes score ~/.claude/projects/<project-id>/

# Machine-readable output
tes score <path> --json

# Version
tes --version

tes serve starts two things: a background scan loop that auto-scores finished Claude Code sessions (token economy + deterministic waste, judge OFF by default), and a web dashboard on http://127.0.0.1:4747/ where scores accumulate.

Scope & Limitations

Read this before installing. These are not caveats to hide — they're the honest picture of what the tool measures and where the calibration comes from.

Corpus caveat (token baselines). The token economy baselines are derived from one developer's 75 quality-gated Claude Code sessions, skewed toward high-intensity infrastructure and ML-ops work (GCP, Cloud Run, training pipelines). B5 generalization validation across 172 independent developers (1,053 SWE-chat CC sessions) found the generalizable repeated-failed-retry rate is ~1.4% — versus 6.6% in the calibration pool, which is a high-waste infra outlier. A developer doing ordinary coding work may score below-band on the token axis without being inefficient; the baseline encodes "efficient under expert prompting on heavy infra work," not a universal reference.

No human accuracy validation. The trajectory judge (Qwen3-30B) is coherence-validated against a reference LLM (Spearman ρ ≈ 0.79), not calibrated to human expert labels. Positive verdicts (MUCH_BETTER/BETTER) are cross-model corroborated at 84–96%. Negative verdicts (WORSE/MUCH_WORSE) are model-dependent — treat them as a signal to review, not a ground truth.

Tiered judge. Token economy and deterministic waste run locally with no GPU and no network — these axes are always available. The trajectory quality axis requires a local Ollama judge (~18 GB VRAM for Qwen3-30B). Without it, trajectory prints UNAVAILABLE, which is the expected complete state for most users, not an error.

What waste detection covers. The two waste detectors catch observable-invariant patterns only: exact-match retry loops with no state change, and redundant file reads where the content was unchanged. Judgment-of-progress waste (was this cycle productive? was this approach the right one?) is not covered — that requires human labeling and is out of scope.

The moat is the product. All scoring is local. Your session logs never leave your machine. No telemetry, no phone-home, no external network calls (except the optional local Ollama endpoint). The localhost bind is enforced by construction, not configuration.

The three axes

No composite score. Three independent labeled signals, each with its own domain of validity.

Token economy

Compares the session's real token count (AI turns only; cache-read inflation removed) against the p25–p75 band for the same task type (ml-eval, debug-fix, infra-deploy, research-recon, feature-build). Verdicts: above_p75, within_band, below_p25, unavailable.

unavailable when the session is below the per-type p10 turn floor (scope gate) — the session is too short relative to the reference mass to produce a meaningful comparison. Not an error.

Domain of validity: calibrated to a high-waste infra/ML-ops corpus (one developer, 75 sessions). Interpret alongside the trajectory verdict.

Trajectory quality

A local Qwen3-30B judge scores the session's trajectory on purposefulness: MUCH_BETTER / BETTER / SIMILAR / WORSE / MUCH_WORSE.

Requires a local GPU (~18 GB VRAM). Without the judge, this axis is UNAVAILABLE — token and waste axes still run fully.

Domain of validity: positive signal cross-model corroborated (B3 report); negative signal is model-dependent. No human gold labels.

To enable:

# Install Ollama: https://ollama.ai
ollama pull qwen3:30b-a3b   # ~18 GB
tes score <path>             # judge auto-detected

Deterministic waste

Two observable-invariant detectors with proof turns attached to every event:

REPEATED-FAILED-RETRY — same shell command + same error output + no state change between retries. Validated across 172 developers (SWE-chat CC). ~1.4% of ordinary CC sessions; ~6.6% in our calibration pool (a high-intensity infra outlier).
REDUNDANT-READ — same file content read twice with no edit between reads (PATH-A: CC's own "File unchanged" verdict; PATH-B: content-match, gap ≤ 5 turns). Dual-format regex handles both pre- and post-v2.1.38 CC output.

Domain of validity: observable-invariant only. Fires conservatively — misses judgment-of-progress waste by design.

`tes serve` — always-available local service

tes serve [--port PORT] [--scan-interval SECONDS] [--stability-window SECONDS] \
          [--cc-path PATH] [--db-path PATH] [--background-judge]

Watcher: scans ~/.claude/projects every 2 minutes (configurable), scores any session file stable for 5+ minutes (token + waste; judge OFF by default).
Dashboard: http://127.0.0.1:4747/ — session list, per-session three-axis detail with domain-of-validity notes inline, trend views.
Store: SQLite at ~/.tes/tes.db (WAL mode; watcher writes and dashboard reads concurrently without locks).
Manual scores share the dashboard: tes score <path> results also write to the store.

Moat properties: binds 127.0.0.1 only (never exposed to external interfaces), no data leaves the machine, redaction on by default at ingestion.

To enable the trajectory judge in the background watcher:

tes serve --background-judge
# WARNING: runs qwen3:30b-a3b (~18 GB VRAM) on your GPU for every new session continuously.

What this does NOT do

No composite efficiency score. The three axes are independent by design — a single number would hide the axis-specific domain limitations.
No "catches all inefficiency." The waste detectors fire on observable-invariant patterns only.
No accuracy guarantee on the trajectory axis. It's an LLM judge, coherence-validated, not human-calibrated.
No data contribution / cloud scoring. The tool is local-only. A voluntary corpus contribution mechanism is on the roadmap (opt-in, redacted digests only) but not built.
No cross-agent support yet. The CC adapter is Claude Code–specific; OpenCode/Codex/Aider would need their own adapters and re-validation.

SDK usage

from tes import load_baselines, score_session, JudgeConfig
from tes.adapt import adapt_session
from tes.baselines import BUNDLED_BASELINES_PATH
from tes.waste import detect_repeated_failed_retry, detect_redundant_read, build_waste_entry

baselines = load_baselines(BUNDLED_BASELINES_PATH)
record = adapt_session("path/to/session.jsonl")   # secrets redacted at ingestion

session_id = record["session_id"]
turns = record["digest"]["turns"]
waste_entry = build_waste_entry(session_id, turns)

# Optional: trajectory judge (returns None → UNAVAILABLE when no local judge)
from tes.judge import score_trajectory
judge_entry = score_trajectory(record)

result = score_session(record, baselines, judge_entry=judge_entry, waste_entry=waste_entry)
print(result.band_verdict)        # "within_band" | "above_p75" | "below_p25" | "unavailable"
print(result.judge_verdict)       # "BETTER" | None
print(result.waste_event_count)   # int
print(result.token_domain_of_validity)   # caveat string, always populated

Validation

The scoring components were validated through a five-phase credibility arc (B1–B5) before packaging. Key results:

Token baselines (B2): 75 quality-gated CC sessions, 5 task types, scope gates at per-type p10 turn floor. See research/08-baselines.md.
Trajectory judge (B3): Cross-model corroboration. Positive verdicts: 84% strict / 96% top-2. Negative verdicts model-dependent. No human gold. See research/09-cross-model.md.
Deterministic waste (B4): RFR fired 12/181 pool sessions (6.6%). RR fired 20/181 (11.0%). Observable-invariant boundary documented. See research/10-deterministic-waste.md.
Generalization (B5): RFR and PATH-A validated across 172 developers (1,053 SWE-chat CC sessions). Rate gap (6.6% pool vs 1.4% SWE-chat) explained by corpus characterization — pool is a high-waste infra outlier. Cross-agent generalization inconclusive (parquet lacks tool_result rows for OpenCode/Codex). See research/11-generalization.md.

License

AGPL-3.0 — free to use and self-host; any modified version distributed as a network service must publish its source under the same license.

Roadmap

Corpus de-biasing: voluntary opt-in digest contribution (no source code, redacted) to build a broader calibration baseline. Not built yet.
Smaller judge: a laptop-runnable quantized model for the trajectory axis (requires a new B3-equivalent corroboration run, not a swap).
Cross-agent support: adapters for OpenCode, Codex, Aider once tool_result data is available for re-validation.
tes install-hook: explicit opt-in SessionEnd hook for zero-latency scoring (modifies ~/.claude/settings.json only on user request).

Recommended user follow-ups (not built): register tracegauge.dev; lawyer review of AGPL terms before any commercial raise.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracegauge-0.1.0.tar.gz (76.9 kB view details)

Uploaded Jun 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tracegauge-0.1.0-py3-none-any.whl (54.5 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file tracegauge-0.1.0.tar.gz.

File metadata

Download URL: tracegauge-0.1.0.tar.gz
Upload date: Jun 7, 2026
Size: 76.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tracegauge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4179ff64c856a42be1b81d19154e5f07e77e4c4b4b6b1f08ce0b3fe938d30f58`
MD5	`fe35bcd4fae65d973364dcd3e3d270f4`
BLAKE2b-256	`912e0f52eae01227ea051bf1ed14ebc0c815b077d81de17c54d01231fb416b67`

See more details on using hashes here.

File details

Details for the file tracegauge-0.1.0-py3-none-any.whl.

File metadata

Download URL: tracegauge-0.1.0-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 54.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tracegauge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e072207d950709f4f3384ad824bc8a088ac0cfd1b086c19997dd43fc0c8cae8`
MD5	`8918cfdc70fd4b86ee8156ff7f3b8a55`
BLAKE2b-256	`d9e55263f7fc698220801b57fec5955750bec29901d62696413bd71abe05a528`

See more details on using hashes here.

tracegauge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tracegauge

Quick start

Scope & Limitations

The three axes

Token economy

Trajectory quality

Deterministic waste

`tes serve` — always-available local service

What this does NOT do

SDK usage

Validation

License

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

tracegauge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tracegauge

Quick start

Scope & Limitations

The three axes

Token economy

Trajectory quality

Deterministic waste

tes serve — always-available local service

What this does NOT do

SDK usage

Validation

License

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tes serve` — always-available local service