Black-box flight recorder for AI agents — record, replay, and diff LLM sessions
Project description
Black-box flight recorder for AI agents — record every LLM call your agent makes, replay sessions deterministically, and export a redacted evidence report when something breaks.
FlightBox is local-first. Recordings live in SQLite. No hosted dashboard is required.
Why
An agent failed and nobody can reproduce it. The final answer is in a log, but the interesting evidence is scattered across LLM requests, tool calls, model responses, timing, tokens, and local notes.
FlightBox gives you a deterministic debugging trail:
- record OpenAI / Anthropic / LiteLLM calls
- replay the same responses later
- diff two runs
- export JSONL or pytest replay tests
- generate a redacted Markdown / HTML report for PRs, CI notes, and teammates
Quick Start
pip install flightbox
Record
import flightbox
from openai import OpenAI
client = OpenAI()
with flightbox.record("debug-session") as rec:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
print(f"Recorded as run: {rec.run_id}")
Replay
import flightbox
with flightbox.replay("abc123def4"):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
Inspect
flightbox list
flightbox show <run-id>
flightbox stats <run-id>
flightbox timeline <run-id>
flightbox diff <run-a> <run-b>
flightbox diff <run-a> <run-b> --ignore-field request
Export
# JSONL eval rows
flightbox export <run-id> -f jsonl -o eval_dataset.jsonl
# Raw payloads are opt-in; the default JSONL export redacts common secrets.
flightbox export <run-id> -f jsonl --raw -o private_fixture.jsonl
# pytest replay skeleton
flightbox export <run-id> -f pytest -o test_replay.py
# redacted evidence report
flightbox report <run-id> -f md -o evidence.md
flightbox report <run-id> -f html -o evidence.html
flightbox report <run-id> \
--note "reproduced after retry-path patch" \
--verify "pytest tests/test_agent.py -q" \
--env repo=agent-demo \
-o evidence.md
# compact redacted call timeline
flightbox timeline <run-id> -o timeline.md
# audit raw recordings before sharing evidence
flightbox audit <run-id>
flightbox audit <run-id> -f json -o audit.json
flightbox audit <run-id> --policy .flightboxignore
The report redacts common API keys, bearer tokens, GitHub tokens, and authorization headers before writing the file. It also records lightweight evidence metadata: notes, verification commands, Python version, platform, and optional KEY=VALUE environment facts.
The timeline is a shorter PR-friendly view: one row per recorded call, with provider, model, latency, token totals, error state, and redacted request / response previews.
The audit command scans the raw recording for common secret patterns and reports only the event, top-level field, JSON path, pattern, and redacted preview. For noisy but safe fields, add a .flightboxignore policy:
# Ignore a whole top-level recording field.
field:token_usage
# Ignore one JSON path. `*` matches list entries.
path:request.messages.*.content
# Disable a pattern by name.
pattern:github-token
LiteLLM
pip install "flightbox[litellm]"
import flightbox
import litellm
with flightbox.record("router-debug") as rec:
litellm.completion(
model="openrouter/openai/gpt-4o-mini",
messages=[{"role": "user", "content": "ping"}],
)
with flightbox.replay(rec.run_id):
litellm.completion(
model="openrouter/openai/gpt-4o-mini",
messages=[{"role": "user", "content": "ping"}],
)
CLI Reference
flightbox list # List recorded runs
flightbox show <run-id> # Show run details and events
flightbox stats <run-id> # Summarize latency, tokens, and errors
flightbox timeline <run-id> # Render a compact redacted call timeline
flightbox audit <run-id> # Check raw payloads for common secret patterns
flightbox audit <run-id> --policy .flightboxignore
flightbox diff <run-a> <run-b> # Compare two runs
flightbox diff <run-a> <run-b> --ignore-field request # Hide expected volatile fields
flightbox export <run-id> # Export as redacted JSONL or pytest
flightbox export <run-id> --raw # Keep raw JSONL payloads for private fixtures
flightbox report <run-id> # Export a redacted evidence report
flightbox report <run-id> --note "..." --verify "pytest -q" --env os=windows
flightbox delete <run-id> # Delete a recording
Supported SDKs
- OpenAI Python SDK (
openai>=1.0) — sync and async - Anthropic Python SDK (
anthropic>=0.20) - LiteLLM (
litellm>=1.0) —completionandacompletion - SDKs and frameworks that call through those clients
Storage
Recordings are stored in .flightbox/recordings.db by default. You can pass a custom database path with --db in the CLI or by constructing RecordStore yourself.
Roadmap
Record / replay / diff / report are solid. The next steps are about covering more of what agents actually call and turning recordings into a real regression gate:
- Wider SDK coverage — Google GenAI, Cohere, and raw HTTP LLM clients, so a recording doesn't depend on which SDK an agent happens to use.
- A baseline assertion for CI —
flightbox assert <run> --against baseline.jsonlthat fails the build when an agent's call sequence drifts from a recorded baseline, so behavior change is caught in review, not in production. - Cost and latency trends — roll the per-call token/latency data already captured into a small cross-run summary, so a regression in spend is as visible as one in output.
- A local transcript viewer — a single-file HTML view of a run's call chain, for when a Markdown timeline isn't enough to see where two runs diverged.
It stays local-first throughout — no recording ever has to leave your machine.
Related projects
- AgentProbe — a pytest plugin for regression-testing AI agents
- agentcikit — CLI tools for AI-agent, MCP, and CI evidence and safety
- CoreCoder — a minimal AI coding agent you can read end to end
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flightbox-0.1.0.tar.gz.
File metadata
- Download URL: flightbox-0.1.0.tar.gz
- Upload date:
- Size: 149.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a622ce701fb42200acefca7704ae69f6db604f5f4280a19095d79f0f447f89d7
|
|
| MD5 |
b9d99da75fde4327619a9f99f6c7ed7b
|
|
| BLAKE2b-256 |
c505e54c57c4c3e02770378a042699250027ffc0e75fd6023f5ad2bd4c3f4fef
|
File details
Details for the file flightbox-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flightbox-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a06e240a8e0c2a7ccd01c441442169c5ee7a4e9060eea0d02329f9a40b4e104
|
|
| MD5 |
61cdbec2649bb0c169f7552c668d051d
|
|
| BLAKE2b-256 |
7d5c06bb18d55e86ba3e3175ee68f1acfa8527209fb7814debfea57455945037
|