Diagnose and auto-fix AI agent performance bottlenecks.
Project description
agentslow
AI agent diagnostics CLI — diagnose, benchmark, and auto-fix AI agent performance.
Part of A.I Shovels — tools that dig into AI infrastructure problems.
pip install agentslow
What it does
One command tells you why your AI agent is slow, expensive, or unreliable — and fixes it.
agentslow diagnose trace.yaml
agentslow classifies your agent into one of four performance regimes (Context-Bound, Reasoning-Bound, Tool-Bound, IO-Bound), computes novel metrics like Token Efficiency Ratio and Tool Re-entry Rate, then prescribes auto-applicable fixes with machine-readable config patches.
Actual CLI Output (not marketing — run it yourself)
$ agentslow diagnose examples/openclaw_research_agent.yaml --entropy --fidelity \
--config examples/agent_config_baseline.yaml --dry-run
═══ agentslow v0.8.0 ═══
Agent: openclaw_research_v2 (langgraph)
Task: Research competitor pricing for SaaS product and compile report
Status: ✓ Success
── REGIME CLASSIFICATION ──
Primary: CONTEXT-BOUND
Confidence: 90%
── KEY METRICS ──
Token Efficiency Ratio (TER): 0.0147
Tool Re-entry Rate: 0.2500
Time-to-First-Action: 1800ms
Reasoning Ratio: 0.0983
Total Cost: $0.5403
Total Duration: 32850ms
── PRESCRIPTIVE FIXES ──
1. [CRITICAL] [AUTO-FIX] Implement context compaction (summarization)
Token Efficiency Ratio is 0.015 (healthy: >0.15). Your context is bloated
with irrelevant tokens. Add a summarization step every N turns to compress
conversation history.
→ Expected: 50-70% token reduction, major cost savings
2. [HIGH] [AUTO-FIX] Enable prompt/prefix caching
Total input tokens: 110,100. Enable prefix caching to avoid re-processing
the same system prompt on every LLM call.
→ Expected: 30-50% latency reduction on repeated calls
3. [MEDIUM] Tune RAG retrieval — retrieve less, retrieve better
You're likely stuffing too many documents into context. Reduce top_k, add
re-ranking, or switch to semantic chunking.
→ Expected: Fewer tokens = lower cost + faster inference
═══ CONTEXT ENTROPY ANALYSIS ═══
Session: openclaw_research_v2
Total Turns: 7
── ENTROPY METRICS ──
Average Entropy: 0.9374
Max Entropy: 1.0000
Entropy Trend: INCREASING
Semantic Drift Ratio: 0.1034
Noise Ratio: 0.2857
Compaction Integrity: 1.0000
── VERDICT ──
✗ CRITICAL — Context is critical
═══ DRY-RUN: Implement context compaction (summarization) ═══
context_compaction.enabled: false → true [CAUTION]
context_compaction.strategy: (none) → summarize_every_n [CAUTION]
context_compaction.n_turns: (none) → 5 [CAUTION]
Rollback: agentslow rollback --fix-id context-001
═══ DRY-RUN: Enable prompt/prefix caching ═══
enable_prompt_caching: false → true [SAFE]
Rollback: agentslow rollback --fix-id context-002
═══ GOLDEN SET FIDELITY TEST ═══
Tests Run: 15
Passed: 15 | Failed: 0
Overall Fidelity: 1.0000
── VERDICT ──
✓ PASSED — Safe to apply in production.
That's one command. Regime classification + metrics + fixes + entropy audit + dry-run diffs + fidelity verification (15 golden cases covering all 4 regimes).
Benchmark: Before/After Proof
$ agentslow benchmark examples/agent_config_baseline.yaml --compare --tasks 5
═══ BENCHMARK COMPARISON ═══
Tasks: 5
── BEFORE (baseline) ──
Tokens: 42,576 avg
Cost: $0.31 avg
Success: 80%
── AFTER (optimized) ──
Tokens: 23,218 avg (-45.5%)
Cost: $0.22 avg (-30.2%)
Success: 80%
Fixes Applied: 3
MICRO-EVAL GUARD: ALL CLEAR
P99 Jitter Audit
$ agentslow benchmark examples/agent_config_baseline.yaml --jitter-audit --jitter-runs 5
═══ P99 JITTER AUDIT ═══
Runs: 5
── STABILITY ──
Overall: STABLE
Worst Jitter: 1.26x (p99/p50)
Production-ready: variance is within acceptable bounds.
CI/CD Integration
# Fails pipeline on CRITICAL entropy
agentslow diagnose trace.yaml --entropy --ci > junit_report.xml
echo $? # exit code 1 on CRITICAL
JUnit XML output integrates directly with GitHub Actions, GitLab CI, Jenkins.
Key Concepts
Performance Regimes
| Regime | Symptom | Auto-Fix |
|---|---|---|
| Context-Bound | Low TER (<0.15), bloated context | Context compaction, prompt caching |
| Reasoning-Bound | High reasoning ratio (>0.5) | Reasoning budget limits, task decomposition |
| Tool-Bound | High tool re-entry (>0.3) | Tool call batching, result caching |
| IO-Bound | High TTA (>2000ms) | Parallel tool execution, connection pooling |
Novel Metrics
- Token Efficiency Ratio (TER):
output_tokens / input_tokens. Healthy: >0.15. Below that, you're paying for context the model ignores. - Tool Re-entry Rate: Fraction of steps that re-call the same tool. High = your agent is retrying or looping.
- Time-to-First-Action (TTA): Milliseconds from prompt to first tool call. Measures reasoning overhead.
- Reasoning Ratio:
reasoning_tokens / total_tokens. How much compute goes to thinking vs. acting.
Context Entropy Monitor
Measures context health over long-running sessions:
- Per-turn entropy scores (0-1)
- Noise ratio (garbage accumulation)
- Semantic drift via TER sliding windows
- Compaction integrity validation
Safe Mode (Trust Architecture)
By default, agentslow shows fixes but doesn't write anything. The --safe-mode flag adds an extra preview layer; --apply is the explicit opt-in to write config patches.
# Preview only (default) — never writes files
agentslow diagnose trace.yaml --dry-run
# Safe mode — extra verbose preview with risk annotations
agentslow diagnose trace.yaml --safe-mode
# Apply — writes config patches after full safety chain
agentslow diagnose trace.yaml --apply
Six-gate safety chain: diagnose → classify → prescribe → golden-set verify → dry-run preview → human review → apply.
Auto-Fix Pipeline
- Diagnose → Classify regime + compute metrics
- Fix → Generate prescriptive fixes with config patches
- Dry-Run → Show diffs with SAFE/CAUTION/WARNING risk levels
- Fidelity → Verify fixes don't change agent behavior (15 golden cases)
- Human Review →
--safe-modepreview before any writes - Apply → Machine-readable JSON config patches (explicit
--applyopt-in) - CI Gate → Non-zero exit on CRITICAL issues
Framework Support
- LangGraph (primary) — native trace parser
- MCP — tool call analysis
- CrewAI — multi-agent traces
- Claude Code — session analysis
Project Stats
- 4,510+ lines across 12 modules
- 101 tests, 15 golden fidelity cases
- 10 atomic commits (v0.1.0 → v0.8.0)
- CLI-first, no dashboard — the moat is auto-fixing
- Six-gate safety chain with
--safe-modetrust architecture
Quick Start
# Install
pip install agentslow
# Diagnose a trace
agentslow diagnose your_trace.yaml
# Full analysis with all features
agentslow diagnose your_trace.yaml \
--entropy \
--fidelity \
--config your_config.yaml \
--dry-run
# Safe mode — preview everything before writing
agentslow diagnose your_trace.yaml --safe-mode
# Apply fixes (explicit opt-in)
agentslow diagnose your_trace.yaml --apply
# Benchmark before/after
agentslow benchmark your_config.yaml --compare --tasks 5
# CI mode (JUnit XML + exit codes)
agentslow diagnose your_trace.yaml --entropy --ci
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentslow-0.8.1.tar.gz.
File metadata
- Download URL: agentslow-0.8.1.tar.gz
- Upload date:
- Size: 63.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c0785dd657e617a4e0a540fdeb61e6df1aa9cda81a7fa2aab9ec15a8c8d70aa
|
|
| MD5 |
63c6a5705657fc77a98ca2eb61826ed1
|
|
| BLAKE2b-256 |
b4f3d2b98fe7395b47d82e4713f573809aad074839b1cd2878b2647f24bbece8
|
File details
Details for the file agentslow-0.8.1-py3-none-any.whl.
File metadata
- Download URL: agentslow-0.8.1-py3-none-any.whl
- Upload date:
- Size: 66.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e046a917d46b2d8514bbda4bffb0e8d8ff97f1cfcb6b7275fdff6703afd40b98
|
|
| MD5 |
1098c9d2efb27e8edd485daa8de52ac9
|
|
| BLAKE2b-256 |
0abdbf6045a9d85e968233b6084612681b4981f5f874792d65a794da614865fb
|