CI for AI agents - behavioral fingerprinting and drift detection
Project description
Spooled — Behavioral CI for AI Agents
One prompt edit quietly turned this customer-support agent into a refund machine. Spooled caught it on the PR.
A PM asks for "a more helpful tone for frustrated customers." An engineer adds one sentence to the system prompt: "Resolve their issue when possible." Unit tests pass. The reviewer approves. The PR is ready to merge.
But the LLM now interprets "resolve" liberally. On complaint tickets, the agent stops escalating refund requests to humans and starts issuing refunds itself. The structure changed even though the prompt looked harmless.
Spooled diffs the agent's behavior against the committed baseline and posts this on the PR:
🚨 Merge blocked: agent now calls `issue_refund`
This tool was never observed in the baseline. It appears in
2 of 5 traces in this PR (~40%).
Triggered by a one-sentence change to the system prompt.
Caught content-blind — Spooled compared tool graphs, not language. It never saw a customer message or an LLM response.
Run it yourself in 60 seconds
pip install spooled-ai
spooled demo
Runs the entire scenario in your terminal — no API key, no setup, no files left behind. The variant agent differs from the baseline by exactly one line in the system prompt. The code is otherwise identical.
What It Does
Capture — wraps your LLM client and records the structural fingerprint of every agent run: which tools were called, in what order, how many times. Content-blind by architecture — prompts, customer data, and AI responses never leave your infrastructure.
Compare — diffs the current run against a committed baseline. Shows exactly what changed: tools added, tools removed, sequence reordered, token usage shifted.
Gate — posts a PR comment with the human-readable consequence as the headline. Blocks the merge if the policy says so. Resolution instructions included.
Install
pip install spooled-ai
Quick Start
import spooled
from spooled.wrappers import wrap_openai
from openai import OpenAI
spooled.init(agent_id="my_agent")
client = wrap_openai(OpenAI())
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Analyze this deal"}],
tools=MY_TOOLS,
)
spooled.shutdown()
That's it. Every tool call is captured. The trace is saved to .spooled/traces/. The hash chain signs every interaction at capture time.
CI Integration
# .github/workflows/spooled.yml
- name: Generate traces
run: python ci_runner.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Spooled behavioral check
run: |
pip install spooled-ai
spooled ci compare .spooled/traces/*.jsonl \
--baseline .github/baselines \
--policy spooled-policy.yml \
--enable-blocking
Example PR comment:
## ❌ Spooled Behavioral CI: FAIL
> Spooled Score: 59/100 (D) 🔴
> [!CAUTION]
> ## 🚨 Merge blocked: agent now calls `issue_refund`
>
> This tool was **never observed in the baseline**. It appears in
> **2 of 5** traces in this PR (~40%).
**5** traces analyzed | ✅ **3** passed | ❌ **2** policy failures
### Trace Results
| Agent | Fingerprint | Status | Score |
|----------------|-----------------|---------------|-------|
| support_agent | `4d893b5cef...` | ⚠️ Behavior change | 59 |
<details>
<summary>🔧 Tool Changes (2 traces)</summary>
- ➕ `issue_refund` added
- ➖ `escalate_to_human` removed
</details>
What Spooled Catches
| Change type | Example | Unit tests | Spooled |
|---|---|---|---|
| Prompt tweak | "Be concise" drops compliance tools | ✅ Pass | Behavior change |
| Model swap | Model drops sanctions screening | ✅ Pass | Behavior change |
| Tool deprecation | Agent proceeds without critical data | ✅ Pass | Behavior change |
| KB refresh | Ticket response path changes | ✅ Pass | Behavior change |
| Schema migration | Field rename breaks detection | ✅ Pass | Behavior change |
| Upstream degradation | Retry paths appear in fingerprint | ✅ Pass | Behavior change |
Content-Blind Architecture
Spooled never captures prompts, customer data, or AI responses. Only structural metadata: tool names, call sequence, token counts, timing. This is enforced in code — content is stripped before the trace reaches disk.
Supported Libraries
LLM Providers (explicit wrappers):
- OpenAI (sync/async, streaming)
- Anthropic (sync/async, streaming)
HTTP & Cloud (auto-instrumented via hooks):
- AWS Bedrock
- requests, httpx, aiohttp
Frameworks (callback handlers):
- LangChain, LlamaIndex, AutoGen, CrewAI, LangGraph
Documentation
License
Proprietary.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spooled_ai-0.5.1.tar.gz.
File metadata
- Download URL: spooled_ai-0.5.1.tar.gz
- Upload date:
- Size: 243.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab91577f1177b301024234c65c97615ffcce9a378e0f579cc1b91bb87ca52dca
|
|
| MD5 |
68c0b0d7114620c10ac4cd17daf85c1d
|
|
| BLAKE2b-256 |
d00daceb49604449dd1cd7d1d6e7c546a1d6da92422708e7d09b556deb55be09
|
File details
Details for the file spooled_ai-0.5.1-py3-none-any.whl.
File metadata
- Download URL: spooled_ai-0.5.1-py3-none-any.whl
- Upload date:
- Size: 288.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44160568c86435f18931b9a1e2871f1931ff93d7cc49e0c2370440866d4556ff
|
|
| MD5 |
9666f89e865e3ce0349eb4cae81bd40e
|
|
| BLAKE2b-256 |
5b979595ce4272c5a6426cdc01b009525792c34cfc3b232d3e0c39db0baa6fdd
|