Skip to main content

Black-box flight recorder for AI agents — record, replay, and diff LLM sessions

Project description

flightbox record and replay

Black-box flight recorder for AI agents — record every LLM call your agent makes, replay sessions deterministically, and export a redacted evidence report when something breaks.

FlightBox is local-first. Recordings live in SQLite. No hosted dashboard is required.

Why

An agent failed and nobody can reproduce it. The final answer is in a log, but the interesting evidence is scattered across LLM requests, tool calls, model responses, timing, tokens, and local notes.

FlightBox gives you a deterministic debugging trail:

  • record OpenAI / Anthropic / LiteLLM calls
  • replay the same responses later
  • diff two runs
  • export JSONL or pytest replay tests
  • generate a redacted Markdown / HTML report for PRs, CI notes, and teammates

Quick Start

pip install flightbox

Record

import flightbox
from openai import OpenAI

client = OpenAI()

with flightbox.record("debug-session") as rec:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

print(f"Recorded as run: {rec.run_id}")

Replay

import flightbox

with flightbox.replay("abc123def4"):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

Inspect

flightbox list
flightbox show <run-id>
flightbox stats <run-id>
flightbox timeline <run-id>
flightbox diff <run-a> <run-b>
flightbox diff <run-a> <run-b> --ignore-field request

Export

# JSONL eval rows
flightbox export <run-id> -f jsonl -o eval_dataset.jsonl

# Raw payloads are opt-in; the default JSONL export redacts common secrets.
flightbox export <run-id> -f jsonl --raw -o private_fixture.jsonl

# pytest replay skeleton
flightbox export <run-id> -f pytest -o test_replay.py

# redacted evidence report
flightbox report <run-id> -f md -o evidence.md
flightbox report <run-id> -f html -o evidence.html
flightbox report <run-id> \
  --note "reproduced after retry-path patch" \
  --verify "pytest tests/test_agent.py -q" \
  --env repo=agent-demo \
  -o evidence.md

# compact redacted call timeline
flightbox timeline <run-id> -o timeline.md

# audit raw recordings before sharing evidence
flightbox audit <run-id>
flightbox audit <run-id> -f json -o audit.json
flightbox audit <run-id> --policy .flightboxignore

The report redacts common API keys, bearer tokens, GitHub tokens, and authorization headers before writing the file. It also records lightweight evidence metadata: notes, verification commands, Python version, platform, and optional KEY=VALUE environment facts. The timeline is a shorter PR-friendly view: one row per recorded call, with provider, model, latency, token totals, error state, and redacted request / response previews. The audit command scans the raw recording for common secret patterns and reports only the event, top-level field, JSON path, pattern, and redacted preview. For noisy but safe fields, add a .flightboxignore policy:

# Ignore a whole top-level recording field.
field:token_usage

# Ignore one JSON path. `*` matches list entries.
path:request.messages.*.content

# Disable a pattern by name.
pattern:github-token

LiteLLM

pip install "flightbox[litellm]"
import flightbox
import litellm

with flightbox.record("router-debug") as rec:
    litellm.completion(
        model="openrouter/openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "ping"}],
    )

with flightbox.replay(rec.run_id):
    litellm.completion(
        model="openrouter/openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "ping"}],
    )

CLI Reference

flightbox list                    # List recorded runs
flightbox show <run-id>           # Show run details and events
flightbox stats <run-id>          # Summarize latency, tokens, and errors
flightbox timeline <run-id>       # Render a compact redacted call timeline
flightbox audit <run-id>          # Check raw payloads for common secret patterns
flightbox audit <run-id> --policy .flightboxignore
flightbox diff <run-a> <run-b>    # Compare two runs
flightbox diff <run-a> <run-b> --ignore-field request  # Hide expected volatile fields
flightbox export <run-id>         # Export as redacted JSONL or pytest
flightbox export <run-id> --raw   # Keep raw JSONL payloads for private fixtures
flightbox report <run-id>         # Export a redacted evidence report
flightbox report <run-id> --note "..." --verify "pytest -q" --env os=windows
flightbox delete <run-id>         # Delete a recording

Supported SDKs

  • OpenAI Python SDK (openai>=1.0) — sync and async
  • Anthropic Python SDK (anthropic>=0.20)
  • LiteLLM (litellm>=1.0) — completion and acompletion
  • SDKs and frameworks that call through those clients

Storage

Recordings are stored in .flightbox/recordings.db by default. You can pass a custom database path with --db in the CLI or by constructing RecordStore yourself.

Roadmap

Record / replay / diff / report are solid. The next steps are about covering more of what agents actually call and turning recordings into a real regression gate:

  • Wider SDK coverage — Google GenAI, Cohere, and raw HTTP LLM clients, so a recording doesn't depend on which SDK an agent happens to use.
  • A baseline assertion for CIflightbox assert <run> --against baseline.jsonl that fails the build when an agent's call sequence drifts from a recorded baseline, so behavior change is caught in review, not in production.
  • Cost and latency trends — roll the per-call token/latency data already captured into a small cross-run summary, so a regression in spend is as visible as one in output.
  • A local transcript viewer — a single-file HTML view of a run's call chain, for when a Markdown timeline isn't enough to see where two runs diverged.

It stays local-first throughout — no recording ever has to leave your machine.

Related projects

  • AgentProbe — a pytest plugin for regression-testing AI agents
  • agentcikit — CLI tools for AI-agent, MCP, and CI evidence and safety
  • CoreCoder — a minimal AI coding agent you can read end to end

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flightbox-0.1.0.tar.gz (149.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flightbox-0.1.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file flightbox-0.1.0.tar.gz.

File metadata

  • Download URL: flightbox-0.1.0.tar.gz
  • Upload date:
  • Size: 149.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for flightbox-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a622ce701fb42200acefca7704ae69f6db604f5f4280a19095d79f0f447f89d7
MD5 b9d99da75fde4327619a9f99f6c7ed7b
BLAKE2b-256 c505e54c57c4c3e02770378a042699250027ffc0e75fd6023f5ad2bd4c3f4fef

See more details on using hashes here.

File details

Details for the file flightbox-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: flightbox-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for flightbox-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a06e240a8e0c2a7ccd01c441442169c5ee7a4e9060eea0d02329f9a40b4e104
MD5 61cdbec2649bb0c169f7552c668d051d
BLAKE2b-256 7d5c06bb18d55e86ba3e3175ee68f1acfa8527209fb7814debfea57455945037

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page