CLI for tracking AI agent task metrics: token cost, retry pressure, and outcome quality.

These details have not been verified by PyPI

Project description

ai-agents-metrics

Analyze your AI agent work history. Track spending. Optimize your workflow.

AI is writing more of your code. You still don't know:

How many attempts each task actually takes
Where the process breaks down and why
Whether your workflow is getting faster or generating more rework

ai-agents-metrics extracts these signals from your existing Claude Code or Codex history — no manual setup required. Point it at your history files and see what's happening: retry pressure, token cost, session timeline. For richer tracking, add explicit goal boundaries and outcome labels on top.

Why this exists

AI coding tools optimize for code generation. That is not the same as optimizing the development system around AI.

A coding agent can succeed at the individual task while the overall workflow degrades — more attempts per goal, more correction passes, more cost per shipped unit.

This project surfaces the signals that matter at the workflow level:

how many attempts goals require,
where retries and failures cluster,
whether outcomes are matching the requested result,
whether cost is trending in the right direction.

The key insight: these signals are already present in your agent history files. ai-agents-metrics extracts them. You do not need to set up manual tracking to get value — though you can add it for explicit goal classification and outcome labels.

It is not a benchmark, an eval framework, or a model comparison tool. It is a local analysis tool for real engineering work done with AI.

What It Tracks

From history files (no setup required):

Retry pressure — how often attempts fail or require correction, derived from session history
Token cost — input, output, and cached-input tokens per session, mapped to USD
Model usage — which model ran each session and what it cost
Session timeline — full activity history from first ingest

With optional manual tracking:

Goals and attempts — what was requested, how many passes it took, and whether each pass succeeded
Outcome quality — result-fit labels for product goals (exact_fit, partial_fit, miss)
Failure reasons — classified cause when an attempt does not succeed

Findings from real data

Running this pipeline against a single developer's 6-month history (3.85B tokens, 160 threads, 334 sessions) surfaced several mechanisms that affect anyone analyzing Claude Code or Codex history — not just this dataset. Each finding is reproducible from raw history files:

#	Finding	Headline
F-001	100% of Claude "retries" are subagent spawns, not user retries	`0 / 160` threads have `main_attempt_count > 1`
F-002	Claude's `role='user'` is mostly not human-typed	86.7% of `role='user'` events are `tool_result` or template
F-003	Naive practice-effectiveness split is size-confounded	20× naive gap collapses to ~2.5× after size-matching; ~half is subagent overhead
F-004	Cross-thread file-rework signal is detectable but N=66 too small for effectiveness claims	61% of implementation threads have a rework follow-up within 30 days
F-005	Discovery and commit-automation dominate real AI-coding practice usage	29% of threads use any practice; discovery=39% of events, code_review=14%
F-006	AI-agent retros overwhelmingly describe meta-tooling failures, not code failures	58% of 59 retros land in 4 meta-tooling themes (packaging, lifecycle, policy, data)
F-007	Within-thread, messages near a practice event use half the tokens of messages far from any practice	median 2.05× gap, 95% CI [1.23×, 2.47×], p=0.000456; Agent events (2.01× agg) ≫ Skill events (1.20× agg) as predicted by compression mechanism
F-008	Per-skill decomposition: subagent-style practices compress 2.5-3.25×, commit skills don't compress (point estimate <1.0×)	`Explore` 2.63× CI [1.71, 3.51] with 13/13 sign test; `code-reviewer` 3.25×; `code-review:code-review` 2.55×; vs `commit` 0.72×, `commit-push-pr` 0.76×

Full index: docs/findings/README.md.

These are N=1 findings about the data, not universal claims. The mechanisms (subagent-aliased retries, role='user' pollution, size-confounded splits) are properties of the underlying tools — they will show up on anyone's history.

Capabilities

Capability	Status
History ingestion from Claude Code and Codex transcripts	Available
Structural session-kind classification (main vs subagent)	Available
Retry pressure derived from session history	Available
Cost and token tracking from history	Available
Automatic cost sync from Claude Code telemetry	Available
Automatic cost sync from Codex telemetry	Available
Before/after workflow comparison	Available
Interactive HTML report with trend charts	Available
Goal and attempt lifecycle CLI (opt-in manual tracking)	Available
Append-only local event log (NDJSON)	Available
Repository bootstrap (`bootstrap` command)	Available
Optional markdown report export	Available
Shell completion (bash, zsh)	Available
Standalone binary packaging	Available
Hosted multi-user dashboards	Not planned
Centralized team analytics	Not planned

Quick Start

pipx install ai-agents-metrics   # macOS / modern Linux (recommended)
# pip install ai-agents-metrics  # inside a virtualenv

# Run the full history pipeline (reads ~/.codex + ~/.claude by default):
ai-agents-metrics history-update
# Or restrict to a single source:
ai-agents-metrics history-update --source codex    # Codex only (~/.codex)
ai-agents-metrics history-update --source claude   # Claude Code only (~/.claude)

# See retry pressure, token cost, and session timeline
ai-agents-metrics show

That's it — no prior setup required. If you want to also add explicit goal tracking on top, see Track a Session.

Example Output

$ ai-agents-metrics show

Codex Metrics Summary
Operational summary:
Closed goals:           17
Successes:              15
Fails:                  2
Success Rate:           88.24%
Known total cost (USD): 312.59
Known total tokens:     776,219,632
By model:
  claude-sonnet-4-6: 16 closed, 14 successes, 2 fails

History signals (warehouse):
  Project threads:           87  (worktrees merged)
  Threads with main_attempt > 1: 0 / 87 (structural retries)
  Subagent spawns:             156 sessions (Agent tool_use)
  Per-goal alignment:          16 / 17 ledger goals matched to history window

The History signals section is derived directly from session history files — no manual tracking required. The pipeline runs a deterministic filename-based classifier that separates main attempts (user-driven retries) from subagent spawns (internal delegation). A naive file-per-attempt count would conflate the two — see F-001.

Install

Install from PyPI:

pipx install ai-agents-metrics   # macOS / modern Linux (recommended)
pip install ai-agents-metrics    # inside a virtualenv

Install from source:

python -m pip install -e .

Install the standalone binary:

make package-standalone
./dist/standalone/ai-agents-metrics install-self

Analyze History

Extract metrics from existing agent session files in one command:

ai-agents-metrics history-update                   # reads ~/.codex + ~/.claude (all sources)
ai-agents-metrics history-update --source codex    # restrict to Codex only
ai-agents-metrics history-update --source claude   # restrict to Claude Code only

Or run the three stages individually (ingest → normalize → derive):

ai-agents-metrics history-ingest
ai-agents-metrics history-normalize
ai-agents-metrics history-derive

Then inspect results:

ai-agents-metrics show

Compare the structured event log against reconstructed history to find gaps:

ai-agents-metrics history-compare

Analyze before/after product metrics around each retrospective event:

ai-agents-metrics derive-retro-timeline

Inspect Metrics

Print a summary of retry pressure, costs, and session history:

ai-agents-metrics show

Explain missing cost coverage and check whether it is recoverable from local agent logs:

ai-agents-metrics audit-cost-coverage

Regenerate the optional markdown report (opt-in, requires manual tracking bootstrap):

ai-agents-metrics render-report

Generate a self-contained interactive HTML report with four trend charts (goal types, retry pressure, cost stacked by model, cost per success) and a summary strip showing total cost, success rate, and cost trend:

ai-agents-metrics render-html
ai-agents-metrics render-html --output report.html --days 30

The report reads token and retry data from the local warehouse when available (full history) and falls back to the event log. The output file has no external dependencies — open it in any browser.

Bootstrap a Repository (opt-in)

To also enable manual goal tracking, run once to scaffold ai-agents-metrics into a repository. Creates the event log, installs the policy document, and injects an agent instructions block:

ai-agents-metrics bootstrap --target-dir /path/to/repo --dry-run
ai-agents-metrics bootstrap --target-dir /path/to/repo

Safe to rerun on a partially initialized repository. Use --dry-run to preview what will be written without making changes.

Track a Session

History extraction gives you retry pressure, cost, and session timelines — but it cannot tell you which specific tasks had the worst outcomes or why a particular session failed. Manual goal tracking adds that layer: per-task breakdowns, outcome quality labels (exact_fit, partial_fit, miss), and classified failure reasons.

If your history shows 32% retry pressure, manual tracking tells you whether it's coming from unclear requirements, model mistakes, or scope problems — and on which tasks.

Concepts

Concept	Meaning
goal	One requested outcome
attempt	One implementation pass or retry for a goal
outcome	Final result: `success` or `fail`
result fit	Quality label: `exact_fit`, `partial_fit`, or `miss` — a goal can succeed but still be a partial fit
failure reason	Primary cause when an attempt fails: `model_mistake`, `unclear_task`, `validation_failed`, `environment_issue`, `scope_too_large`, `missing_context`, `tooling_issue`, `other`

Start a goal

Record a new goal before implementation begins:

ai-agents-metrics start-task --title "implement login endpoint" --task-type product

Goal types: product for delivery work, retro for retrospective writeups, meta for bookkeeping and tooling work.

Record a correction pass

If the first attempt needed correction, record the retry:

ai-agents-metrics continue-task --task-id 2026-04-08-001 --failure-reason model_mistake

Close the goal

When the goal is complete, close it with an outcome and optional quality label:

ai-agents-metrics finish-task --task-id 2026-04-08-001 --status success --result-fit exact_fit
ai-agents-metrics finish-task --task-id 2026-04-08-001 --status fail --failure-reason unclear_task

Ensure bookkeeping is in place

If work has already started without an active goal, use this to detect and create a recovery draft:

ai-agents-metrics ensure-active-task

Backfill cost data

Backfill token and cost data from local agent telemetry into existing goal records. Supports Claude Code and Codex automatically — no provider flag required:

ai-agents-metrics sync-usage

Audit the ledger

Flag suspicious ledger patterns — likely misses, stale in-progress goals, and low cost coverage:

ai-agents-metrics history-audit

Privacy and Storage

All data stays local. ai-agents-metrics writes only to:

.ai-agents-metrics/warehouse.db — local SQLite warehouse used by the history pipeline
metrics/events.ndjson — append-only event log for manual goal tracking (opt-in)
docs/ai-agents-metrics.md — optional markdown export (regenerated on demand)

No data is sent to any remote service. The event log is a plain NDJSON file you can read, audit, and version-control yourself.

Verify Your Install

make verify

Runs lint, security scan, typecheck, tests, and the public boundary check.

Public Boundary

This repository contains the public-safe core only. Private retrospectives, internal audits, and local metrics history are kept in a separate private overlay. The boundary is enforced automatically:

make verify-public-boundary

Repository

github.com/sg4tech/ai-agents-metrics

Releasing

Publishing to PyPI is automated via GitHub Actions. To release a new version:

Merge all changes to main on this repository
Create and push a tag: git tag v0.x.y && git push public v0.x.y
The publish workflow triggers automatically on the tag and publishes to PyPI via Trusted Publishing — no credentials required locally

Manual publishing with twine is not needed — the workflow handles it.

Contributing

Read CONTRIBUTING.md. In short: keep changes public-safe, run make verify, include tests for behavior changes.

Security

See SECURITY.md for how to report potential private-data leaks or security issues.

Changelog

Notable public changes are tracked in CHANGELOG.md.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.2

Apr 21, 2026

0.2.1

Apr 20, 2026

This version

0.2.0

Apr 20, 2026

0.1.5

Apr 13, 2026

0.1.4

Apr 12, 2026

0.1.3

Apr 12, 2026

0.1.2

Apr 12, 2026

0.1.1

Apr 10, 2026

0.1.0

Apr 9, 2026

0.0.0.dev0 pre-release

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_agents_metrics-0.2.0.tar.gz (296.0 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_agents_metrics-0.2.0-py3-none-any.whl (148.3 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file ai_agents_metrics-0.2.0.tar.gz.

File metadata

Download URL: ai_agents_metrics-0.2.0.tar.gz
Upload date: Apr 20, 2026
Size: 296.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_agents_metrics-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cdb8d65185e1d28661fdabdac0b4c3615a8ee9e8f21c6dd1023d9993c5ec7808`
MD5	`81f9ab589fdab4b8db229ac2485235d5`
BLAKE2b-256	`c158b5e4b42f185762f46e1e130679df0146d653cf2e27df011e4483ae0ae8ff`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_agents_metrics-0.2.0.tar.gz:

Publisher: publish.yml on sg4tech/ai-agents-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_agents_metrics-0.2.0.tar.gz
- Subject digest: cdb8d65185e1d28661fdabdac0b4c3615a8ee9e8f21c6dd1023d9993c5ec7808
- Sigstore transparency entry: 1342323214
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: sg4tech/ai-agents-metrics@26dee6daeb02ec0397e48789c17852ab30c6a248
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/sg4tech
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@26dee6daeb02ec0397e48789c17852ab30c6a248
- Trigger Event: push

File details

Details for the file ai_agents_metrics-0.2.0-py3-none-any.whl.

File metadata

Download URL: ai_agents_metrics-0.2.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 148.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_agents_metrics-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`79a054f40a6e37ddf53432e29740a2cb2635db7193c262327df5280361f020e8`
MD5	`ff2ab5a0f18fd79186e9ddcae055f082`
BLAKE2b-256	`c64cff0559d8eca6c98c87d139b8c77c420bbc8896e75db90fed337211679daf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_agents_metrics-0.2.0-py3-none-any.whl:

Publisher: publish.yml on sg4tech/ai-agents-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_agents_metrics-0.2.0-py3-none-any.whl
- Subject digest: 79a054f40a6e37ddf53432e29740a2cb2635db7193c262327df5280361f020e8
- Sigstore transparency entry: 1342323222
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: sg4tech/ai-agents-metrics@26dee6daeb02ec0397e48789c17852ab30c6a248
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/sg4tech
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@26dee6daeb02ec0397e48789c17852ab30c6a248
- Trigger Event: push

ai-agents-metrics 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ai-agents-metrics

Why this exists

What It Tracks

Findings from real data

Capabilities

Quick Start

Example Output

Install

Analyze History

Inspect Metrics

Bootstrap a Repository (opt-in)

Track a Session

Concepts

Start a goal

Record a correction pass

Close the goal

Ensure bookkeeping is in place

Backfill cost data

Audit the ledger

Privacy and Storage

Verify Your Install

Public Boundary

Repository

Releasing

Contributing

Security

Changelog

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance