Skip to main content

Local-first agent analytics with prompt diagnostics

Project description

AgentFluent

Local-first agent analytics with behavior-to-improvement diagnostics. The tools that exist tell you what your agent did — AgentFluent tells you how to make it better.

PyPI version Python versions CI License: MIT

AI agents are in production at 57% of organizations, and quality is the single top barrier to deployment. When an agent misbehaves — wrong tool choice, retry loops, hallucinated outputs — developers iterate on prompts blind. Existing observability platforms show what happened: traces, latency, token counts. They don't tell you why the agent misbehaved or what in its configuration to change.

AgentFluent reads your local Claude Code and Claude Agent SDK session JSONL, extracts agent invocations and tool patterns, scores each agent's configuration against a best-practice rubric, and correlates observed behavior back to specific fixes — a prompt gap, a missing tool constraint, or a stale model selection. No cloud services, no API keys, no data leaves your machine.

Born from CodeFluent research that identified the agent-quality gap in 2026. See docs/AGENT_ANALYTICS_RESEARCH.md for additional market analysis.

What AgentFluent Scores

Every recommendation lands on one of three axes. CLI output prefixes each finding with [cost], [speed], or [quality] so you can prioritize by what matters right now:

Axis What it tracks Example finding
[cost] tokens, cache efficiency, model fit, offload candidates This agent uses Opus where Sonnet would do
[speed] duration, retry density, tool-call churn, stuck patterns This agent retries Bash 5× before giving up
[quality] user mid-flight corrections, file rework, reviewer-caught rate This agent ships work that gets immediately rewritten

The three often trade off — saving cost can hurt quality, chasing speed can hurt cost. AgentFluent surfaces the trade-off rather than collapsing it to a single score.

Positioning

The agent observability space is crowded — several tools capture what agents do. None diagnose why they misbehave or what to change from locally-persisted session data. In the table below, "What's missing" is what the tool does not do (not what it provides):

Tool What it measures What's missing
Langfuse / LangSmith / Arize Phoenix Production traces, latency, token counts, errors Behavior-to-prompt diagnosis; local agent config audit
Braintrust / Galileo / DeepEval LLM-as-judge scoring against rubrics Requires cloud instrumentation and author-provided test sets; no local agent config audit
ccusage / claude-code-analytics / agents-observe Usage stats, token counts, subagent trees Quality scoring; actionable config recommendations
claude-code-otel OpenTelemetry export of Claude Code sessions Analysis itself — it's a bridge to other tools
Anthropic Console Per-request cost, rate-limit tracking Session-level diagnostics; agent config recommendations

Use Langfuse/Phoenix for production traces, Braintrust for test-set evals, ccusage for usage dashboards, and AgentFluent for what in the agent's config to change. The question "my Agent SDK agent ran 500 sessions last week — were any of them actually good, and how do I update the configuration to make them better?" has no answer from the tools above. AgentFluent is built to answer it.

What makes AgentFluent different

  • The config is the agent. In interactive sessions the human course-corrects mid-flight; in programmatic agents the prompt and tool setup are the agent, and a flaw compounds at scale. AgentFluent scores description, allowed_tools / disallowedTools, model, and prompt on every agent definition, and audits MCP server configuration (configured-but-unused, observed-but-missing) against real tool usage.
  • Behavior-to-improvement, not just traces. When the agent retries Bash 40% of the time, AgentFluent tells you which prompt clause is missing — not just that the retry happened. Every diagnostic maps to a specific gap; see docs/AGENT_ANALYTICS_RESEARCH.md for the feasibility and positioning analysis.
  • JSON envelope as a contract. A stable {version, command, data} schema lets you build PR gates, trend dashboards, and regression detectors on top without tracking AgentFluent's internal refactors.
  • CLI-native and local by default. agentfluent analyze --format json | jq ... fits agent developer workflows (terminal, CI/CD, PR checks). No outbound network calls unless you explicitly opt in via --github (Tier 3 GitHub enrichment, v0.8+); no cloud services, no API key required.

Sibling: CodeFluent

CodeFluent and AgentFluent both read ~/.claude/projects/ session JSONL but answer different questions:

CodeFluent AgentFluent
Unit of analysis Conversations in interactive sessions, plus the supporting .claude/ config (CLAUDE.md, rules, hooks, commands) Agent definitions + their observed behavior
Scoring target Developer's AI collaboration fluency and project-config maturity Agent's prompt, tools, model, hooks
Feedback loop Coaches the human to interact with Claude Code better Tells the developer what config to change
Delivery VS Code extension + web app CLI-first (dashboard deferred)
API calls Anthropic API for LLM-as-judge scoring None by default; --github (v0.8+) opts in to GitHub-API quality signals via the local gh CLI

If you write your own prompts each session, use CodeFluent. If your prompts live in ClaudeAgentOptions, AgentDefinition, or .claude/agents/*.md files, use AgentFluent.

Screenshots

Execution Analyticsagentfluent analyze --project <name> --no-diagnostics

Execution Analytics: token usage, Cost by Model with parent vs subagent Origin column, tool frequency, and Agent Invocations tables

Behavior Diagnosticsagentfluent analyze --project <name> (diagnostics on by default)

Behavior Diagnostics: Diagnostic Signals + Top-N priority fixes summary above the aggregated Recommendations table + Offload Candidates section ranking parent-thread tool-burst clusters by estimated savings

Suggested Subagents with copy-paste-ready YAML draftagentfluent analyze --project <name> --verbose

Suggested Subagents: medium-confidence cluster + YAML subagent definition ready to save as ~/.claude/agents/<name>.md

Comparison Workflowagentfluent diff baseline.json current.json

agentfluent diff output: New / Resolved / Persisting recommendation row classes with severity, count_delta, priority_score_delta, plus token / cost / cache deltas and per-agent invocation deltas

Config Assessmentagentfluent config-check

Config Assessment: per-agent 0-100 scoring across description, tools, model, prompt dimensions with recommendations

Screenshots are regenerated from real session data via scripts/generate_readme_screenshots.py.

Getting Started

Prerequisites

  • Python 3.12 or newer. Check with python --version.
  • Claude Code or Agent SDK session data. Generated automatically at ~/.claude/projects/ whenever you use Claude Code or run an Agent SDK script — nothing to configure.
  • Platforms: Linux, macOS, Windows. Pure-Python package; the path handling resolves ~/.claude/ on every platform.

Install

# Preferred — isolated tool install via uv (https://docs.astral.sh/uv/)
uv tool install agentfluent

# Fallback — pip into a venv of your choice
pip install agentfluent

# Zero-install one-shot
uvx agentfluent list

Optional extras

  • agentfluent[clustering] — installs scikit-learn and enables delegation clustering, which proposes new specialized subagents from recurring general-purpose invocations. Without this extra, agentfluent analyze --diagnostics still runs all other diagnostics, but delegation_suggestions is always empty in JSON output and the "Suggested Subagents" section is omitted from terminal output. Install with uv tool install 'agentfluent[clustering]' or pip install 'agentfluent[clustering]'.

Preserve your session history (one-time, recommended)

AgentFluent can only analyze the sessions Claude Code still has on disk — and Claude Code deletes session files older than 30 days by default. That default silently bounds every longitudinal analysis (regression detection, baselines, multi-month trends), and the deletion is unrecoverable (Claude Code deletes, it doesn't archive).

Before you accumulate history you'll want later, raise the retention window once by adding cleanupPeriodDays to ~/.claude/settings.json:

{
  "cleanupPeriodDays": 3650  // ~10 years; keeps sessions for long-term analysis
}

agentfluent analyze checks this setting at runtime and prints a warning when retention is at or below the default, quoting the exact file path to edit. Raising it does not recover already-deleted sessions — it only protects future ones — so it's worth doing on day one.

First run

# Discover which projects have session data
agentfluent list

# Analyze agent behavior + cost in a specific project
agentfluent analyze --project myproject

# Score your agent definitions against the config rubric
agentfluent config-check

Commands

agentfluent list — discover projects and sessions

agentfluent list                                     # All projects
agentfluent list --project codefluent                # Sessions in one project
agentfluent list --format json | jq '.data.projects[].name'

Lists every Claude Code / Agent SDK project found under ~/.claude/projects/, with session counts, total size, and last-modified timestamp. Pass --project to drill into one project and list its individual session files.

agentfluent analyze — token, cost, and behavior metrics

agentfluent analyze --project codefluent                       # Full analysis with behavior diagnostics
agentfluent analyze --project codefluent --no-diagnostics      # Token + cost only (skip diagnostics pipeline)
agentfluent analyze --project codefluent --agent pm            # Filter to one subagent
agentfluent analyze --project codefluent --latest 5            # Last 5 sessions only
agentfluent analyze --project codefluent --since 7d            # Sessions from the last 7 days only (v0.6)
agentfluent analyze --project codefluent --since 2026-05-01 --json > baseline.json  # Time-scoped baseline for diff
agentfluent analyze --project codefluent -v                    # + YAML subagent drafts
agentfluent analyze --project codefluent --min-severity warning  # Hide info-level recs
agentfluent analyze --project codefluent --top-n 10            # Top-10 priority fixes summary block
agentfluent analyze --project codefluent --git                 # + Tier 2 local-git quality signals (v0.7)
agentfluent analyze --project codefluent --github              # + Tier 3 GitHub-API quality signals (v0.8, opt-in)
agentfluent analyze --project codefluent --format json | jq '.data.token_metrics.total_cost'

# Save the top-confidence cluster as a real subagent definition:
agentfluent analyze --project codefluent --format json \
  | jq -r '.data.diagnostics.delegation_suggestions[0].yaml_draft' \
  > ~/.claude/agents/new-agent.md

Produces a token-usage table — including model turns (model_turns, one merged, non-synthetic assistant message — a real model response; v0.9), the headline agent-efficiency metric, with a Synthetic responses row breaking out Claude Code's local ghost turns (e.g. "No response requested.") that are netted out of the count — a per-model cost breakdown (labeled as API rate — subscription plans differ), tool usage concentration, and an Agent Invocations table summarizing each subagent's token, duration, and tool-use count. Per-agent-type rows show average turns per invocation alongside efficiency ratios (avg_tool_calls_per_turn, avg_tokens_per_turn), and the duration column reports active time per call with raw wall-clock in parentheses (470.0s (2918.0s wall); v0.9), so an interactive agent whose wall-clock is inflated by user-approval waits no longer reads as slow. The Cost by Model table breaks out parent vs subagent rows (a model used in both shows two rows with Origin distinguishing them), and the top-line Total cost / Total tokens are comprehensive — parent thread + linked subagent runs combined. Behavior diagnostics run by default and surface signals across three layers (pass --no-diagnostics to skip):

  • Metadata-level (from invocation summaries): tool-error keywords, token-per-tool-use outliers, duration outliers, TOOL_ORCHESTRATION_CHAIN (long tool-call chains whose large intermediate results pass through context — v0.9, INFO with a low-confidence caveat), and TOOL_INVENTORY_OVERSIZED (an agent declares >30 tools but exercises under half — v0.9).
  • Trace-level (from ~/.claude/projects/<session>/subagents/): retry loops, stuck patterns, permission failures, consecutive tool-error sequences, and PARAMETER_RETRY (a tool retried with a changed parameter shape after a validation error — v0.9, surfacing a paste-ready input_examples fix) — each with per-tool-call evidence.
  • Aggregate: model mismatch (complexity class wrong for declared/observed model), delegation clustering (recurring general-purpose patterns → proposed specialized subagents), MCP server audit (configured-but-unused, observed-but-missing).
  • Quality (v0.6): parent's mid-flight corrections (USER_CORRECTION), file rework density (FILE_REWORK), and reviewer-caught rate with parent_acted attribution (REVIEWER_CAUGHT). Recommendations carry a [quality] axis label and per-recommendation axis_scores annotations.
  • Quality Tier 2 (v0.7, opt-in via --git): local-git FEAT_FIX_PROXIMITY — pairs feat: commits with subsequent fix: commits on shared files and correlates back to whether the originating session used a review-style subagent.
  • Quality Tier 3 (v0.8, opt-in via --github): GitHub-API signals CI_FAILURE_FIRST_PUSH (PRs whose first commit's combined CI status is failure/error) and PR_REVIEW_COMMENT_DENSITY (external review-comment density above the configured threshold). Requires the gh CLI for auth, a file-backed response cache trims repeated GitHub calls, and a first-run consent prompt records opt-in under ~/.config/agentfluent/. The JSON envelope exposes tier3_degraded: bool so CI consumers can see when rate limits forced a partial fetch. Zero Tier 3 signals on a clean --github run is a healthy outcome, not a broken feature — a "push only on green CI" workflow or solo-author PRs legitimately produce no findings; tier3_degraded (not signal count) is the load-bearing "did the enrichment run" flag. See the CI_FAILURE_FIRST_PUSH and PR_REVIEW_COMMENT_DENSITY glossary entries for the per-signal healthy-silence patterns.

Above the Recommendations table, a Top-N priority fixes summary ranks findings by a composite priority_score that combines severity, occurrence count, cost impact (for target='model' mismatches), trace-evidence boost, and (v0.6) quality-evidence boost — so the highest-leverage changes surface first instead of asking the reader to scan a flat severity-sorted list. The sort key is part of the JSON envelope (aggregated_recommendations[].priority_score), so a CI gate can fail the run on priority regression. An Offload Candidates section calls out clusters of repeating tool-use patterns in the parent thread and proposes subagent / skill drafts that move that work onto cheaper-tier models — the dominant cost lever for users running agents at scale.

Each recommendation surfaces an axis label — [cost], [speed], or [quality] — naming which diagnostics axis triggered it. The JSON envelope exposes the same information per recommendation as axis_scores: {cost, speed, quality} and primary_axis, so a CI rule can target a specific dimension (e.g. fail only on new quality-primary findings).

Near-duplicate recommendations are aggregated per (agent, target, signal) shape into one row with an occurrence Count and metric range (e.g. "4 invocations (4.9x–8.0x above 5,064 mean). Consider adding more specific instructions..."). Each recommendation carries a specific config surface to change (prompt, tools, model, mcp) and a pointer to the file to edit. Recommendations for built-in agents (Explore, general-purpose, Plan, etc.) use concern-specific action text — wrapper subagent for scope issues, retry bounds on the delegating agent for recovery issues, reroute for tools/model — since built-in agents have no user-editable prompt or tool config.

Cost numbers reflect current per-token pricing; historical sessions are priced at today's rates until #80 (time-series pricing) lands.

agentfluent diff — compare two analyze runs

agentfluent analyze --project codefluent --json > baseline.json   # before a prompt change
# ... edit agent prompts / tools / model ...
agentfluent analyze --project codefluent --json > current.json    # after the change

agentfluent diff baseline.json current.json                       # side-by-side report
agentfluent diff baseline.json current.json --fail-on critical    # CI gate: exit 3 only on new critical findings
agentfluent diff baseline.json current.json --json | jq '.data.regression_detected'

Compares two analyze --json envelopes and surfaces new, resolved, and persisting recommendations (keyed by (agent_type, target, signal_types)), token / cost deltas, per-agent invocation deltas, and (v0.9) model-turn deltas — both at the parent-session level and per agent type, so a prompt change that cuts the turns an agent takes for the same task shows up as a measurable delta. The --fail-on {info|warning|critical|off} flag gates exit code 3 on new findings at or above the chosen severity, so agentfluent diff slots into a PR check the same way a test runner does. Baselines are user-managed files — no internal cache — so re-running against an older snapshot at any time is just agentfluent diff old.json new.json.

agentfluent report — render an analyze snapshot as Markdown

agentfluent analyze --project codefluent --json > snap.json   # capture a snapshot
agentfluent report snap.json                                  # Markdown to stdout
agentfluent report snap.json --output report.md               # ...or to a file
agentfluent analyze --project codefluent --json | agentfluent report /dev/stdin   # one-shot pipe

Renders an analyze --json snapshot envelope as a Markdown document — the same Summary / Token Metrics / Agent Metrics / Diagnostics / Offload / Reproduction sections that analyze prints to the terminal, but in a form you can paste into a PR comment, attach as a CI artifact, or commit alongside a prompt change as a checked-in review trail. report is a separate subcommand rather than analyze --format markdown so the rendering layer stays decoupled from session ingestion: snapshots round-trip through file storage without re-running analysis. The Reproduction footer always echoes the original agentfluent analyze command line so a downstream reader can reproduce the run.

agentfluent config-check — score agent definitions

agentfluent config-check                          # All user + project agents
agentfluent config-check --scope user             # Only ~/.claude/agents/
agentfluent config-check --agent pm --verbose     # One agent with detailed recs
agentfluent config-check --format json | jq '.data.scores[] | select(.overall_score < 60)'

Walks ~/.claude/agents/*.md and ./.claude/agents/*.md, parses each agent's YAML frontmatter and body, and scores against a 4-dimension rubric (description trigger quality, tool access appropriateness, model selection, prompt completeness). Outputs a score per agent plus ranked recommendations — e.g. "Prompt body doesn't mention error handling."

Glossary

analyze --diagnostics and config-check introduce AgentFluent-specific vocabulary (signal types, severity, confidence tiers, recommendation targets). docs/GLOSSARY.md defines every term that appears in CLI output, with worked examples and detection thresholds.

Configuration

AgentFluent's "configuration" is CLI flags — no config file, no environment variables beyond the defaults. Sensible defaults keep most invocations flagless.

Flag Default What it controls
--project (required on analyze) Filter to a specific project slug or display name
--scope all config-check scope: user, project, or all
--agent (none) Filter analyze or config-check to one subagent type
--latest N (all sessions) analyze only the N most recent sessions
--since DATETIME (none) analyze/list: include sessions whose first message landed at or after this time. ISO 8601, date-only (YYYY-MM-DD), or relative (7d, 12h, 30m). Half-open interval with --until. Mutually exclusive with --session.
--until DATETIME (none) analyze/list: include sessions whose first message landed strictly before this time. Same formats as --since.
--session (all) analyze a specific session filename within the project. v0.7+: scope auto-applies to diagnostics — signals, recommendations, and offload candidates are computed from this session only (v0.6 rolled them up across the project). Mutually exclusive with --latest and --since/--until.
--diagnostics / --no-diagnostics on analyze: behavior-correlation signals (default on; --no-diagnostics skips the pipeline)
--git off analyze: enable Tier 2 local-git quality signals (FEAT_FIX_PROXIMITY). Off by default — AgentFluent does not shell out to git unless explicitly opted in. The project's source directory must be a git repo; non-repo dirs silently skip.
--github off analyze: enable Tier 3 GitHub-API quality signals (CI_FAILURE_FIRST_PUSH, PR_REVIEW_COMMENT_DENSITY). Off by default — AgentFluent does not call GitHub unless explicitly opted in. Requires --diagnostics. Implies --git. Requires the gh CLI to be installed and authenticated; a first-run prompt records consent under ~/.config/agentfluent/.
--repo (inferred) analyze with --github: explicit OWNER/NAME override when the project's git remote does not point at GitHub or the source directory is not itself a git working tree.
--github-no-cache off analyze with --github: bypass the Tier 3 response cache for this run. Fresh data is fetched from GitHub and written back to the cache (next run sees the updated entries). No effect without --github.
--min-cluster-size 5 Delegation clustering: minimum invocations per cluster (requires agentfluent[clustering])
--min-similarity 0.7 Delegation dedup: cosine-similarity threshold against existing agents
--top-n N 5 analyze: number of priority-ranked recommendations to summarize above the Recommendations table. Pass 0 to disable the summary block.
--min-severity (none) analyze: drop recommendations below info / warning / critical. Filters both the default table and the --verbose per-invocation surface; signals are not affected.
--fail-on (none) diff: gate exit code 3 on new recommendations at or above info / warning / critical, or off to disable.
--claude-config-dir ~/.claude/ Override the Claude config root (also honors $CLAUDE_CONFIG_DIR)
--format table Output format: table (Rich) or json (envelope). Shortcut: --json (equivalent to --format json)
--verbose off Extra detail: per-session breakdown, per-invocation detail, raw (un-aggregated) recommendations, and YAML subagent drafts for suggested clusters
--quiet off Suppress non-essential output (useful in CI)

Output formats

Default (table): Rich-rendered tables in the terminal, designed to be readable at a glance. Colors auto-adapt to terminal theme.

JSON envelope (--format json, or the shortcut --json): Stable schema {version, command, data} intended as a contract — pipe to jq, integrate with CI, build regression gates on top. Example:

{
  "version": "2",
  "command": "analyze",
  "data": {
    "window": {
      "since": "2026-05-01T00:00:00Z", "until": null,
      "session_count_before_filter": 42, "session_count_after_filter": 12
    },
    "total_model_turns": 312,
    "total_synthetic_messages": 4,
    "token_metrics": {
      "total_cost": 41.11,
      "total_tokens": 54019983,
      "by_model": [
        {"model": "claude-opus-4-7", "origin": "parent",   "cost": 30.68, "input_tokens": 6829, ...},
        {"model": "claude-opus-4-7", "origin": "subagent", "cost":  1.50, "input_tokens": 1213, ...},
        {"model": "claude-opus-4-6", "origin": "subagent", "cost":  8.93, "input_tokens": 7825, ...}
      ]
    },
    "tool_usage": [...],
    "agent_invocations": [...],
    "diagnostics": {
      "aggregated_recommendations": [
        {
          "agent_type": "general-purpose",
          "target": "subagent",
          "signal_types": ["user_correction"],
          "primary_axis": "quality",
          "axis_scores": {"cost": 0.0, "speed": 0.0, "quality": 14.0},
          "priority_score": 226.0,
          "severity": "warning",
          "count": 7,
          "message": "user_correction: Consider delegating to a review-style subagent ..."
        },
        ...
      ],
      "offload_candidates": [...],
      "delegation_suggestions": [...],
      "delegation_suggestions_skipped_reason": null
    }
  }
}

Schema v2 (v0.5): token_metrics.by_model changed from a dict keyed by model name to a list of rows where each row carries an origin field ("parent" or "subagent"). Two rows can share a model with different origins (Opus used in both parent and subagent runs). Top-level total_cost and total_tokens are now comprehensive — they include subagent contributions. agentfluent diff reads both v1 and v2 envelopes (legacy v1 rows normalize as origin="parent"), so saved baselines remain diffable across the upgrade.

Schema v2 additions (v0.6, additive — no version bump): Each aggregated_recommendations row carries axis_scores: {cost, speed, quality} and primary_axis: "cost" | "speed" | "quality". The composite priority_score formula gains a quality_evidence_factor * W_QUALITY term that fires when a recommendation's signals map to the quality axis (per D021). analyze --json output also carries a top-level window: {since, until} block when --since/--until is set on the invocation (null for either bound when not specified). agentfluent diff reads pre-v0.6 envelopes without these annotations cleanly — the absence is treated as zeros / cost-primary, which preserves the existing comparison semantics.

Schema v2 additions (v0.7, additive — no version bump): analyze --json output carries a top-level scope_session: string | null field — populated with the filename when --session constrained the run, null otherwise. Lets consumers verify scope at a glance without re-counting sessions. diagnostics_version (the AgentFluent version that produced the envelope) is also stamped at the top level so agentfluent diff can warn on detector-sensitivity drift between baseline and current. agentfluent diff reads pre-v0.7 envelopes cleanly — the absent fields are treated as null / unknown, preserving comparison semantics.

Schema v2 additions (v0.9, additive — no version bump): a top-level total_model_turns (model turns aggregated across all analyzed sessions) sits alongside session_count in datanot nested under token_metrics — and each entry in data.sessions carries a per-session model_turns. A model turn is one merged, non-synthetic assistant message; Claude Code's <synthetic> ghost responses are excluded and tallied separately as a top-level total_synthetic_messages (and per-session synthetic_message_count), so model_turns equals token_metrics.api_call_count except on the rare case of a real-model turn with no usage block. Per-agent-type rows under agent_metrics.by_agent_type gain avg_turns_per_invocation, avg_tool_calls_per_turn, avg_tokens_per_turn, and estimated_avg_cost_per_turn_usd efficiency ratios. agentfluent diff reports model-turn deltas per agent type and at the parent-session level; pre-turn-era envelopes (pre-v0.9) degrade gracefully — the absent counts read as 0, so a diff against an old baseline still runs without crashing.

No ANSI escapes in JSON output, guaranteed. The key total_cost is the pay-per-token equivalent; subscribers on Pro/Max/Team/Enterprise plans see a flat monthly charge regardless.

How It Works

flowchart LR
    subgraph Local["Local filesystem — nothing leaves this boundary"]
        S["Session JSONL<br/>~/.claude/projects/"]
        ST["Subagent traces<br/>&lt;session&gt;/subagents/"]
        A["Agent definitions<br/>~/.claude/agents/"]
        M["MCP config<br/>~/.claude.json<br/>.mcp.json"]
    end

    S --> P[Parser]
    ST --> TP[Trace Parser<br/>+ Linker]
    P --> X[Agent Extractor]
    P --> TM[Token &amp; Cost<br/>Metrics]
    P --> TU[Tool Usage<br/>Patterns]
    TP --> X
    A --> CS[Config Scanner]
    CS --> SC[Config Scorer]
    M --> MD[MCP Discovery]

    X --> DX[Delegation<br/>Clustering]
    X --> MR[Model-Routing<br/>Analysis]
    X --> SIG[Signal Extraction<br/>metadata + trace]
    SIG --> COR[Correlator]
    MR --> COR
    DX --> COR
    MD --> COR
    SC --> COR

    COR --> OUT["Rich tables<br/>or JSON envelope"]
    TM --> OUT
    TU --> OUT
    SC --> OUT

Step by step:

  1. Parse JSONLcore/parser.py reads each session file into typed SessionMessage objects. Handles streaming snapshot deduplication, plain-string vs. array content shapes, and Claude Code's real toolUseResult format (see CLAUDE.md for the format spec).
  2. Parse subagent tracestraces/parser.py reads per-session subagent files under <session>/subagents/agent-<agentId>.jsonl and reconstructs the internal tool-call sequence with is_error flags. traces/linker.py attaches each trace back to its parent invocation via agentId. traces/retry.py detects retry sequences within a trace.
  3. Discover projects and sessionscore/discovery.py enumerates ~/.claude/projects/ and surfaces friendly display names.
  4. Extract agent invocationsagents/extractor.py walks messages, pairs Agent tool_use blocks with their tool_result content blocks, and pulls per-invocation metadata (tokens, duration, tool-use count) from the containing user message's toolUseResult sibling.
  5. Compute token and cost metricsanalytics/tokens.py aggregates usage per model with <synthetic> sentinel filtering; analytics/pricing.py applies per-token rates labeled as API rate.
  6. Score agent configurationsconfig/scanner.py parses YAML frontmatter from each .md in .claude/agents/ and ~/.claude/agents/; config/scoring.py scores description, tools, model, and prompt on a 4-dimension rubric.
  7. Discover MCP serversconfig/mcp_discovery.py reads mcpServers from ~/.claude.json (user + project-local scopes) and .mcp.json (project-shared), honoring the enabledMcpjsonServers / disabledMcpjsonServers gating arrays. Used by the audit phase to compare against observed mcp__* tool usage.
  8. Diagnose behaviordiagnostics/ extracts metadata signals (signals.py), trace-level signals (trace_signals.py — retry loops, stuck patterns, permission failures, error sequences), model-routing mismatches (model_routing.py), and MCP audit signals (mcp_assessment.py). correlator.py routes each signal to a config target (prompt/tools/model/mcp) and emits an actionable recommendation.
  9. Propose new subagentsdiagnostics/delegation.py clusters recurring general-purpose invocations via TF-IDF + KMeans and drafts candidate subagent definitions with name, model, tool list, and prompt scaffold. Under --verbose, each draft is emitted as a copy-paste-ready YAML frontmatter block. Deduped against existing agents by cosine similarity.
  10. Rendercli/formatters/table.py emits Rich tables; cli/formatters/json.py emits the stable JSON envelope. Format is selected by --format.

Everything runs locally. No outbound network calls. No API key needed.

Features

  • Project and Session Discovery — Enumerates ~/.claude/projects/, groups sessions by project, shows per-project session count, total size, and last-modified timestamp. Handles Claude Code subagent sidechain files and Agent SDK sessions uniformly.

  • Execution Analytics — Token usage, model turns (one merged, non-synthetic assistant message — a real model response; v0.9, with <synthetic> ghost turns excluded and tallied separately), API-rate cost, cache efficiency, per-model breakdown, tool-call concentration, and per-agent invocation metrics (tokens, active + wall-clock duration, tool-use count). Cache creation and cache read tokens are tracked separately so you can see where your prompt caching is working. Model turns surface at every level — parent session, subagent invocation, and per-agent-type rollup with efficiency ratios (avg_tool_calls_per_turn, avg_tokens_per_turn) — because reducing turns for a fixed task is the dominant lever on both cost and latency.

  • Comprehensive Cost Attribution — Top-line total_cost and total_tokens reflect parent + subagent runs combined. The per-model breakdown decomposes by (model, origin) so a user who sees "100% Opus" can tell whether their Haiku-routed Explore subagent contributed cost, and whether Opus spend lives in the parent thread or a delegated agent. JSON envelope is at schema v2.

  • Agent Config Assessment — 4-dimension rubric (description, tools, model, prompt) applied to every .md file in ~/.claude/agents/ and ./.claude/agents/. Produces a 0–100 score plus ranked, specific recommendations ("Prompt body doesn't mention error handling"). Catches agents that are technically valid but miss well-known best practices.

  • Subagent Trace Parsing — Parses the internal tool-call sequences Claude Code emits under ~/.claude/projects/<session>/subagents/agent-<agentId>.jsonl, links them back to the delegating invocation, and detects retry sequences. Gives diagnostics per-call evidence (which tool, which attempt, which error) instead of just an invocation-level summary.

  • Behavior Diagnostics--diagnostics emits signals across three layers. Metadata: tool-error keywords, token-per-tool-use outliers, duration outliers. Trace-level: retry loops, stuck patterns (same call repeated with no progress), permission failures, consecutive tool-error sequences. Aggregate: model mismatch (declared/observed model wrong for the workload's complexity), MCP server audit (configured-but-unused, observed-but-missing). Near-duplicate recommendations collapse into one row per (agent, target, signal) shape with an occurrence Count and metric range. Recommendations for built-in agents (Explore, general-purpose, Plan, code-reviewer, etc.) use concern-specific action text since built-ins have no user-editable config. Each signal routes to a target config surface — prompt, tools, model, or mcp — and the recommendation names the file to edit and the specific change to make.

  • Advanced Tool Use Diagnostics (v0.9) — Three signals grounded in Anthropic's Advanced Tool Use engineering research, each pointing at a specific platform feature that fixes the pattern. PARAMETER_RETRY (trace-level): a tool retried with a changed parameter shape after a validation error — the agent is guessing at the input format; when a later call succeeds, AgentFluent extracts it as a paste-ready input_examples entry (the Tool Use Examples fix lifts complex-parameter accuracy 72%→90%). TOOL_INVENTORY_OVERSIZED: an agent declares >30 tools but exercises fewer than half — a tool-selection-accuracy cliff the Tool Search Tool addresses. TOOL_ORCHESTRATION_CHAIN: long tool-call chains whose large intermediate results flow through the context window, where Programmatic Tool Calling (allowed_callers: ["code_execution_20250825"]) keeps intermediates out of context — shipped at INFO with an explicit low-confidence caveat (the metadata-only proxy can't yet distinguish a true chain from an agent that legitimately needs each intermediate; trace-level precision is tracked as #499).

  • Quality Axis (v0.6 → v0.8) — A third diagnostics axis alongside cost and speed, surfacing gaps that look "free" by token math but produce quality debt. Built out in three tiers, each adding a new data source on top of the previous one:

    • Tier 1 (v0.6, always on with --diagnostics) — JSONL-only signals: USER_CORRECTION (parent's mid-flight corrections like "no, do X instead"), FILE_REWORK (same file edited at or above the calibrated threshold within a session), and REVIEWER_CAUGHT (substantive findings from architect/security-review/tester subagents, with parent_acted attribution and a healthy-band interpretation that recognizes legitimate rejection as collaboration, not a defect).
    • Tier 2 (v0.7, opt-in via --git) — Local-git FEAT_FIX_PROXIMITY: pairs feat: commits with subsequent fix: commits sharing at least 2 code files (.md/.yaml/.yml excluded) and correlates back to whether the originating session used a review-style subagent.
    • Tier 3 (v0.8, opt-in via --github) — GitHub-API signals CI_FAILURE_FIRST_PUSH (PRs whose first commit's combined CI status is failure/error) and PR_REVIEW_COMMENT_DENSITY (external review-comment density above the configured threshold). Auth via the gh CLI (no AgentFluent token storage); file-backed TTL cache trims repeated calls; first-run consent prompt; rate-limit-degraded runs are visible to consumers via tier3_degraded: bool on the JSON envelope.

    Recommendations carry a [quality] axis label, and per-recommendation axis_scores: {cost, speed, quality} plus primary_axis annotations let CI rules and agentfluent diff reason about which dimension changed. Single-axis classification keeps the same threshold meaning the same thing across surfaces. Calibrated against the dogfood corpus (see scripts/calibration/ and the per-signal calibration markdowns under .claude/specs/analysis/).

  • Date-Range Filtering (v0.6)--since/--until on agentfluent analyze and agentfluent list scope analysis to a session window using ISO 8601, date-only (YYYY-MM-DD), or relative (7d, 12h, 30m) input. Half-open interval semantics (consistent with git log and time-series conventions). Closes the dogfood loop for "did my fix work?" workflows and enables retroactive baselines for diff. Analyze JSON output carries a window: {since, until} block when either flag is set.

  • Priority Ranking — A composite priority_score ranks recommendations by severity, occurrence count, cost impact (model-mismatch findings carry the dollar savings), trace-evidence boost, and (v0.6) quality-evidence boost when quality-axis signals fire. The default Recommendations table is sorted by priority desc, and a Top-N priority-fixes summary surfaces above the table so the highest-leverage changes are the first thing the reader sees. --top-n N controls the summary depth; --min-severity {info|warning|critical} filters the recommendation surface without touching the underlying signals.

  • Offload Candidates — Detects clusters of repeating tool-use patterns in the parent Claude Code thread, estimates the cost saved by routing them through a cheaper subagent or skill, and proposes a draft definition for each cluster. The dominant cost lever for users running agents at scale: a Sonnet thread that does 80 GitHub PR reviews per week is cheaper as a Haiku-routed pr-review subagent. Calibrated against real-world burst distributions (scripts/calibration/).

  • Comparison Workflowagentfluent diff baseline.json current.json compares two analyze --json envelopes, classifying each recommendation as new / resolved / persisting, computing token / cost / cache deltas, and emitting per-agent invocation deltas. --fail-on {info|warning|critical} gates exit code 3 on new findings at or above the chosen severity, so agentfluent diff slots into a PR check the same way a test runner does. Baselines are user-managed files — no internal cache — so re-running against an older snapshot at any time is just one command. Reads both v1 (legacy) and v2 (current) JSON envelopes via a compatibility shim.

  • Delegation Clustering — TF-IDF + KMeans on recurring general-purpose invocations surfaces patterns that would benefit from their own specialized subagent. Proposes a complete draft: name, description, recommended model (with cost reasoning), tool list derived from the cluster's trace data, and a prompt-body scaffold. Under --verbose, each cluster emits a copy-paste-ready YAML subagent definition block (frontmatter + prompt body) that can be saved directly as ~/.claude/agents/<name>.md. Low-confidence clusters are kept but prefixed with a REVIEW BEFORE USE comment so loose groupings don't land in production blindly. Confidence tiers (high/medium/low) are calibrated against real-world cohesion distributions from multi-contributor datasets. Suppresses drafts that overlap existing agents and annotates the overlap. Requires the optional agentfluent[clustering] extra.

  • Hook Coverage Diagnostics (v0.10) — The recommendation engine reaches the hooks config surface it was previously blind to. A hook_inspector reads each agent definition's PostToolUse hooks and reports whether they surface a given field; when a duration_outlier fires on an agent with no duration_ms timing hook, DurationOutlierRule recommends adding one — surfaced through the new target=hooks recommendation surface — because slow tool calls go undetected at runtime with no hook gating on duration_ms. This is the first recommendation into the hooks surface (foundational and extensible, not exhaustive hook analysis); project-level hooks in .claude/settings.json are not yet inspected. See the target_hooks glossary entry.

  • Model-Routing Diagnostics — Per-agent-type classification of observed complexity (tool-call counts, token footprint, error rate, write-tool presence) compared against the agent's declared model tier. Flags overspec (complex model on simple workload — cost savings estimate included) and underspec (simple model struggling). Recommendations name a concrete one-tier-down target model (e.g. "route to claude-haiku-4-5") rather than a vague "use a faster model" (v0.10), falling back to a task-scoping suggestion only when the current model is unknown or already at the fastest tier. Consumes trace-based model inference when frontmatter is absent.

  • MCP Server Assessment — Reads configured MCP servers from ~/.claude.json (user + project-local) and .mcp.json (project-shared), honoring per-user enable/disable gating. Compares against observed mcp__<server>__* tool usage from both parent sessions and subagent traces. Emits MCP_UNUSED_SERVER (INFO, configured but zero calls) and MCP_MISSING_SERVER (WARNING, failing calls to an unconfigured server) signals with actionable recommendations.

  • JSON Output Envelope — Stable {version, command, data} schema. No ANSI escapes. Intended as a programmatic contract for CI integration, PR gates, and regression tracking.

  • Quiet and Verbose Modes--quiet for CI-friendly one-line summaries; --verbose for per-session breakdown and per-invocation detail tables. Defaults target interactive humans.

Privacy and Security

AgentFluent is designed so data stays on your machine. The attack surface is small by construction — no web server, no HTML rendering, no webview, and no outbound network calls unless you explicitly opt in via --github (Tier 3 GitHub enrichment, v0.8+). This table summarizes the layers that protect it:

Layer Mechanism Protects Against
Network calls opt-in All analysis is local by default. --github is off by default; when on, calls flow through your local gh CLI (no AgentFluent token storage) to the GitHub API only — a first-run consent prompt records the opt-in under ~/.config/agentfluent/. Surprise data exfiltration
Path handling All paths resolved within ~/.claude/ (or the override $CLAUDE_CONFIG_DIR) Path traversal
Input validation Pydantic models with strict type constraints Malformed JSONL crashing the parser
Safe YAML loading yaml.safe_load only Arbitrary code execution via frontmatter
CI security review Claude-powered review when needs-security-review label is added New vulnerabilities
Automated testing 1600+ unit tests incl. security-focused cases Regressions

Secrets handling

Claude Code persists every tool output to ~/.claude/projects/<slug>/*.jsonl — including any .env, credentials.json, or shell rc file that Claude ever read. .gitignore does not protect against this. AgentFluent itself emits only aggregate metrics, so it cannot leak secrets that weren't already on disk — but because the tool reads that data, contributors working on AgentFluent risk re-leaking while they work.

This repo ships two Claude Code hooks in .claude/settings.json to reduce that risk:

  • PreToolUse block (.claude/hooks/block_secret_reads.py) — denies reads of .env*, .envrc, credentials.json, secrets.{yaml,yml,json}, *.pem, SSH private keys, and shell rc files. Blocks before execution, so the file's contents never enter the session transcript.
  • PostToolUse detect (.claude/hooks/detect_secrets_in_output.py) — scans tool output for sk-ant-*, sk-proj-*, ghp_*, github_pat_*, AKIA*, or AIza* patterns. If a match is found, blocks Claude from echoing or summarizing it. The raw value is already on disk at this point, so treat any caught value as compromised and rotate.

Any future AgentFluent feature that surfaces raw session content (diff viewers, prompt excerpts, recommendation snippets that quote session text) must re-apply secret-pattern redaction at the display layer — historical JSONL on users' machines may still contain pre-hook leaks.

See docs/SECURITY.md for the full policy: leak vector, defense architecture, discipline rules, historical-leak audit one-liner, user-scope deployment, and the bypass surface the hooks do not cover.

Tech Stack

  • Python 3.12+
  • Typer + Rich — CLI framework and terminal formatting
  • Pydantic v2 — data models across module boundaries
  • PyYAML — agent definition frontmatter parsing (safe_load only)
  • pytest + pytest-cov — 1600+ tests
  • mypy strict mode — full type coverage
  • ruff — linting and formatting
  • uv — package and dependency management

Project Structure

src/agentfluent/
├── cli/                 # Typer app, commands, formatters (table + JSON envelope)
├── core/                # JSONL parser, session models, project/session discovery
├── agents/              # Agent invocation extraction and AgentInvocation model
├── analytics/           # Token/cost metrics, tool patterns, model pricing
├── config/              # Agent definition scanner + scoring + MCP server discovery
├── traces/              # Subagent trace parsing, linking, and retry detection
└── diagnostics/         # Behavior signals (metadata + trace), correlation,
                         # model routing, delegation clustering, MCP audit

Full architecture and conventions are documented in CLAUDE.md.

Development

git clone https://github.com/frederick-douglas-pearce/agentfluent.git
cd agentfluent
uv sync
uv run agentfluent --help

Testing

uv run pytest -m "not integration"            # 1600+ unit tests (CI default)
uv run pytest                                 # Full suite incl. integration tests against your real ~/.claude/projects/
uv run pytest --cov=agentfluent               # With coverage

Integration tests (tests/integration/) are skipped in CI because they require real session data — they pass on contributor machines with populated ~/.claude/projects/.

Lint and type check

uv run ruff check src/ tests/
uv run mypy src/agentfluent/

Both must pass cleanly before a PR merges.

CI/CD

Five GitHub Actions workflows run automatically:

  • CI (ci.yml) — Every PR: ruff, mypy strict, full unit-test suite. Must pass to merge.
  • Security Review (security-review.yml) — Claude-powered security review of code-changing PRs, triggered by the needs-security-review label (re-trigger by removing and re-adding).
  • Claude Code Review (claude-review.yml) — AI-powered PR review, triggered by the needs-review label or @claude mentions.
  • Release Please (release-please.yml) — Auto-generates release PRs with changelog and version bumps from Conventional Commits.
  • Dependabot Auto-Merge (dependabot-auto-merge.yml) — Auto-merges dependabot PRs once CI passes.

Roadmap

Current release: v0.10.0 — "Close the Hook Gap" — the recommendation engine reaches a config surface it was previously blind to: hooks. A new hook_inspector detects when an agent has no duration_ms timing hook, and DurationOutlierRule now recommends adding one (the new target=hooks recommendation surface); model-routing recommendations name a concrete one-tier-down model instead of "a faster model." A parallel research stream makes first empirical contact with Agent SDK session data (epic #517) — the groundwork for the SDK-native work in v0.11. Next: v0.11 — per-turn diagnostic ratios pending dogfood validation, trace-level precision for the Advanced Tool Use signals, and SDK-format synthesis building on the v0.10 corpus.

See docs/ROADMAP.md for the full version history (release themes, headline features, design context per release) and CHANGELOG.md for the commit-level log. Browse open issues for the full backlog.

Troubleshooting

Problem Solution
No projects found Verify ~/.claude/projects/ exists and contains per-project subdirectories with .jsonl session files. Claude Code creates these automatically the first time you use it.
No agent invocations Agent invocation rows require the session to actually call a subagent (Agent tool_use with a subagent_type). A session that never delegated has no agent data to analyze — this is not an error.
Zero tokens / dashes in Agent Invocations If you're on AgentFluent ≤ 0.1.0, this is the #84 parser bug — upgrade with uv tool upgrade agentfluent.
Python version error AgentFluent requires Python 3.12+. Check with python --version and upgrade if needed.
Non-default session path Pass --claude-config-dir /path/to/.claude or set $CLAUDE_CONFIG_DIR before invoking any command. The override applies to project discovery, agent configs, and MCP server discovery together.
Malformed JSON at <file>:<line> warning A session file has a corrupted line — usually null bytes left behind when Claude Code was killed mid-write. The parser skips the line and continues; analytics are unaffected. Safe to ignore, or delete the line with sed -i '<line>d' <file> to silence the warning.
Stale tool install after local build If uv tool install --from <path> agentfluent seems to reuse cached code, run uv tool uninstall agentfluent && uv cache clean agentfluent before reinstalling.

Research Foundations

AgentFluent's behavior-to-improvement approach is grounded in research on agent quality, observability gaps, and production failure modes:

Contributing

Contributions welcome. Start by reading CONTRIBUTING.md for dev setup, conventions, and the PR checklist. The architecture overview in CLAUDE.md is the canonical reference for package layout, naming, and the JSONL format.

Branching: feature/<issue>-description for features, fix/<issue>-description for bugs. Commit messages follow Conventional Commits — release-please uses them to cut versions and write the changelog automatically.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentfluent-0.10.0.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentfluent-0.10.0-py3-none-any.whl (311.0 kB view details)

Uploaded Python 3

File details

Details for the file agentfluent-0.10.0.tar.gz.

File metadata

  • Download URL: agentfluent-0.10.0.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentfluent-0.10.0.tar.gz
Algorithm Hash digest
SHA256 d9ba83a0e78e4a47d3a7a9fc0995cc1c6c3ced065337375045e9bc6addf96b86
MD5 9bb8a06e57113be77673fd717feb97d2
BLAKE2b-256 14bfab2fec6c8a7fca704d51c0a2f265d22b2cba4ec50702cdb3d9c4d7c0e718

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfluent-0.10.0.tar.gz:

Publisher: release-please.yml on frederick-douglas-pearce/agentfluent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentfluent-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: agentfluent-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 311.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentfluent-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7931adabe3413a9b951a0d1edfb8474288380d08bb422e250c5e220e21f5f3a0
MD5 159354f29e35498e5fa6b7da71cc47a1
BLAKE2b-256 beb8f264ec2be4bf49a72cd8ab4a5f4c3db549a5f37fcb5729e40d33b089b419

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfluent-0.10.0-py3-none-any.whl:

Publisher: release-please.yml on frederick-douglas-pearce/agentfluent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page