Skip to main content

Terminal AI agent with built-in execution tracing and observability

Project description

BlueClaw

Understand, debug, and control AI agent behavior.
Structured tracing, context management, and reproducible runs — all from the terminal.

Quickstart · Tracing · Testing · Models · Roadmap · License

PyPI Version License Python Version GitHub Issues CI


BlueClaw Typical agent frameworks
Structured execution traces Every run, automatic None or manual logging
Regression testing YAML specs, TAP/JUnit, Wilson CI Not available
Trace replay Step-through debugger Not available
Trace diff A/B test prompt changes Not available
Trace explain LLM post-hoc analysis Not available
Aggregate stats Cost, timing, failure rates Not available
CLI-first debugging No dashboards required Dashboard or nothing

Quickstart

pip install blueclaw
blueclaw init
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
blueclaw

Tracing & Observability

Every agent run produces a structured JSON trace. Nine CLI commands let you inspect runs after the fact — no dashboards, no external services, no setup.

See what happened: trace graph

$ blueclaw trace graph 20260315-054426

search for Python 3.13 new features
├── web_search (1ms) ✓  query: Python 3.13 new features
├── web_search (1ms) ✓  query: Python 3.13 new features list 2024
└── http_request (366ms) ✓  url: https://docs.python.org/3.13/whatsnew/3.13.html

Find the bottleneck: trace timeline

$ blueclaw trace timeline 20260315-054426

Goal: search for Python 3.13 new features
Model: claude-sonnet-4-6 · 3 steps · 1840 tokens · $0.0073

 #  Tool          Start    Duration  Cumulative  Bar
 1  web_search      +0ms      1ms         1ms    █
 2  web_search    +120ms      1ms         2ms    █
 3  http_request  +250ms    366ms       368ms    ██████████████████████

Tool time: 368ms · Wall time: 4100ms · Overhead: 91%

Understand why: trace explain

Feed a recorded trace to an LLM for post-hoc explanation.

$ blueclaw trace explain 20260315-054426

The agent searched for Python 3.13 features, found the results too generic,
refined its query to include "list 2024", then fetched the official changelog
from docs.python.org. The two-step search pattern suggests the first results
didn't contain enough detail...

Post-hoc explanation · not the agent's actual reasoning

Compare two runs: trace diff

$ blueclaw trace diff 20260315-054426 20260315-071830

Run A: 20260315-054426  Run B: 20260315-071830
Goal A: search for Python 3.13 new features
Goal B: search for Python 3.13 new features

Steps:  3 → 2 (-1)
Tokens: 1840 → 1200 (-640)
Cost:   $0.0073 → $0.0048
Time:   368ms → 420ms (+52ms)

Debug step by step: trace replay

$ blueclaw trace replay 20260315-054426

Step 1: web_search (1ms) ✓
  input query: Python 3.13 new features
  output: Found 10 results...
[Enter] next · [q] quit >

Track performance: trace stats

$ blueclaw trace stats --since 7

Trace Stats · 23 runs · last 7 days

Overview
  Total runs:     23
  Total steps:    87
  Avg steps/run:  3.8
  Avg tokens/run: 2,450
  Avg cost/run:   $0.0082
  Total cost:     $0.19

Timing
  Avg duration:    5.1s
  Median duration: 4.2s
  p95 duration:    12.3s
  Avg tool time:   2.1s (41% of wall)

Top Tools (by frequency)
  shell_command        34 calls (39%)
  web_search           28 calls (32%)
  http_request         18 calls (21%)
  file_read             7 calls (8%)

Failed Steps (3 across 2 runs · 3.4% step failure rate)
  timeout              2 (67%)
  network              1 (33%)

All trace commands

Command Use case
trace list Find a run ID to inspect
trace show <id> Detailed step table with timing
trace graph <id> Quick tree view of tool sequence
trace timeline <id> Find bottlenecks — where does time go?
trace explain <id> LLM explains what happened and why
trace diff <id1> <id2> Compare two runs (A/B test prompts)
trace replay <id> Step-through debugger for tool calls
trace replay <id> --stub-tools Re-run with recorded outputs, compare tool sequence
trace stats Aggregate performance across all runs
trace purge Delete old traces (default: 30 days)

Regression Testing

Define expected agent behavior in YAML, run it as a test suite, get CI-friendly output.

Test spec

# test-spec.yaml
tests:
  - goal: search for Python web frameworks and save to frameworks.txt
    expected_tools: [web_search, shell_command]
    expected_file_contains:
      frameworks.txt: "Django"
    tool_order: [web_search, shell_command]
    forbidden_tools: [http_request]
    max_steps: 5

  - goal: check the current weather in Tokyo using wttr.in
    expected_tools: [http_request]
    expected_output_contains: Tokyo
    max_cost: 0.05
    runs: 5
    threshold: 0.55

model: anthropic/claude-haiku-4-5-20251001
allowlist_domains:
  - wttr.in

Run tests

$ blueclaw test test-spec.yaml

TAP version 13
1..2
ok 1 - search for Python web frameworks and save to frameworks.txt
ok 2 - check the current weather in Tokyo using wttr.in

Assertions

Field Check
expected_tools Every listed tool was called (subset match)
expected_output_contains Case-insensitive substring match on response
max_steps Agent used no more than N tool calls
max_cost Run cost stayed under budget
forbidden_tools None of these tools were called
expected_files Each path exists in workspace after the run
expected_file_contains File exists AND contains substring (case-insensitive)
forbidden_output_contains Substring must NOT appear in response
output_regex Regex pattern must match response
tool_order Tools appear in this subsequence order
max_duration_s Wall-clock time under budget

Spec-level fields

Field Purpose
model Override model for all tests in the spec
allowlist_domains Domains allowed for http_request (merged with blueclaw.yaml)

Multi-run with Wilson CI

LLMs are non-deterministic. Set runs: N (N > 1) to execute multiple times and get a statistically valid verdict instead of brittle pass/fail:

  • Pass — Wilson CI lower bound >= threshold
  • Fail — Wilson CI upper bound < threshold
  • Inconclusive — CI straddles the threshold (needs more runs)

Inconclusive tests exit 0 so they don't break CI, but surface as # INCONCLUSIVE in TAP and <skipped> in JUnit XML.

Output formats

blueclaw test spec.yaml                          # TAP to stdout (default)
blueclaw test spec.yaml --format junit           # JUnit XML to stdout
blueclaw test spec.yaml -o results.xml -f junit  # write to file
blueclaw test spec.yaml --dry-run                # validate spec, no API calls
blueclaw test spec.yaml --keep-workspace         # preserve workspaces for inspection
blueclaw test spec.yaml --model anthropic/claude-haiku-4-5-20251001  # override model

Exit code: 0 on all pass/inconclusive, 1 on any failure.

Per-run diagnostics

With --keep-workspace, each run directory contains .blueclaw/result.json — the full TestResult with verdict, failures, tools called, cost, and duration. Inspect individual runs to understand why a multi-run case passed or failed:

$ cat /tmp/blueclaw-test-.../case-007/run-002/.blueclaw/result.json
{
  "goal": "check the current weather in Tokyo using wttr.in",
  "passed": true,
  "verdict": "pass",
  "tools_called": ["http_request"],
  "cost": 0.009,
  "duration_s": 4.4
}

Stub replay

Re-run a recorded trace with stubbed tool outputs — no real execution, no API cost for tools:

$ blueclaw trace replay 20260315-054426 --stub-tools

Original: web_search -> http_request
Replayed: web_search -> http_request
Result: MATCH (same tool sequence)

Use --model to test whether a different model makes the same tool choices given the same context.

Model Support

blueclaw                                    # Anthropic (default)
blueclaw --model ollama/llama3              # Ollama (local)
blueclaw --model openai/gpt-4.1-mini       # OpenAI
blueclaw --model litellm/gemini/gemini-2.0-flash  # Gemini via LiteLLM

Set API keys in .env:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

Configuration

blueclaw.yaml in your project root:

model:
  provider: anthropic
  model_id: claude-sonnet-4-6

workspace:
  path: ~/blueclaw/workspace/
  trace_retention_days: 30             # auto-purge old traces; 0 = keep forever

tools:
  - web
  - shell
  - pdf
  - mcp:https://localhost:8080/sse     # custom MCP server

allowlist_domains:
  - github.com
  - docs.python.org

Architecture

BlueClaw Architecture

Module Purpose
cli.py Typer entrypoints, welcome banner, trace tooling
session.py Config, model factory, agent, chat loop, background context updater
workspace.py Sandbox enforcement, context/history/trace I/O
observer.py Structured tool tracing + output truncation
models.py Pydantic models, trace schema, cost calculation, error classification
testing.py Test spec loading, runner, assertions, formatters, stub replay
tools/ Web, shell, MCP wiring (factory pattern)
approval.py Shell command + domain allowlist hooks

Built on Strands Agents SDK. The agent loop, tool execution, streaming, and model switching are all handled by Strands.

Roadmap

See docs/roadmap.md for the full roadmap with milestone details.

Development

pip install -e ".[dev]"
pytest
flake8 blueclaw/ tests/
black --check blueclaw/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blueclaw-1.4.1.tar.gz (70.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blueclaw-1.4.1-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file blueclaw-1.4.1.tar.gz.

File metadata

  • Download URL: blueclaw-1.4.1.tar.gz
  • Upload date:
  • Size: 70.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for blueclaw-1.4.1.tar.gz
Algorithm Hash digest
SHA256 c656fd009d9b7df7ab86105c63336d41df8a0379f0481d54411b05702793b273
MD5 c6e7c886c846fd40c74d4b78987ba5a3
BLAKE2b-256 c4a0c58fb9724f10955f6e4706e109e9c4e24303e6b00b95ab3b7be6a66bf0c1

See more details on using hashes here.

File details

Details for the file blueclaw-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: blueclaw-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for blueclaw-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9fffbcb3d04510dd5aef9f88684034005e06732f6102fca6890327a41867564a
MD5 4f89effedb2a69d3251ca2f747b05d80
BLAKE2b-256 f81970cabf3a16856cc73f20ec9c789f71b8041e05ba387facafb54fa9fd0c5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page