Framework-agnostic, OTel-native toolkit for reliable, evaluatable, debuggable, and self-healing LLM agents in production
Project description
๐ ๏ธ SpecOps
Agent Reliability Kit
Framework-agnostic, OTel-native toolkit for reliable, evaluatable, debuggable, and self-healing LLM agents in production.
Getting Started โข Features โข Simulation โข Coordination โข Roadmap โข Contributing
The Problem
LLM agents fail silently. They hallucinate, loop, drift off-task, and degrade without warning. Teams building agentic systems today lack:
- Observability โ No standardized way to trace agent reasoning, tool calls, and decision paths
- Evaluation โ No framework-agnostic way to measure if agents actually do what they're supposed to
- Debugging โ When agents fail, root-cause analysis is guesswork
- Self-healing โ Agents crash and stay crashed; no recovery patterns exist
- Simulation โ No way to test for emergent failures before they hit production
Getting Started
Installation
pip install specops-ai
With framework adapters:
pip install specops-ai[langgraph] # LangGraph support
pip install specops-ai[crewai] # CrewAI support
pip install specops-ai[all] # All adapters
One-Line Quickstart
from specops_ai import trace_agent
@trace_agent(name="my-agent")
def agent(task: str) -> str:
return "done" # Your agent logic โ now fully traced via OTel
Trace Any Agent
from specops_ai import trace_agent, trace_tool, trace_llm
@trace_tool(name="search")
def search(query: str) -> list[str]:
return ["result1", "result2"]
@trace_llm(model="gpt-4o", provider="openai")
def call_llm(prompt: str) -> dict:
return {"text": "...", "model": "gpt-4o", "input_tokens": 10, "output_tokens": 25}
@trace_agent(name="research-agent")
def agent(task: str) -> str:
results = search(task)
return call_llm(f"Summarize: {results}")["text"]
Record & Replay
from specops_ai import replayable, recording, replaying
@replayable
def call_llm(prompt: str) -> str:
return "..." # Your LLM call
# Record
with recording(session_id="session-1", seed=42) as session:
result = call_llm("What is 2+2?")
# Replay deterministically
with replaying("session-1"):
same_result = call_llm("What is 2+2?") # Identical output
Self-Healing
from specops_ai import self_healing, RetryPolicy, FallbackPolicy
@self_healing(
retry=RetryPolicy(max_retries=3, base_delay=0.5),
fallback=FallbackPolicy(fallback_fn=backup_llm),
)
def call_llm(prompt: str) -> str:
... # Auto-retries, falls back if exhausted
Simulation Sandbox
from specops_ai import simulation
with simulation("loop-test", max_steps=50, loop_threshold=3) as sim:
for action in agent_actions:
event = sim.record("my-agent", action)
if event.anomaly:
print(f"Detected: {event.anomaly.value}")
result = sim.stop()
assert result.passed
Multi-Agent Coordination
from specops_ai import check_consensus, check_divergence, AgentOutput, BehaviorTrace
# Consensus check
result = check_consensus([
AgentOutput(agent="a", output="yes"),
AgentOutput(agent="b", output="yes"),
AgentOutput(agent="c", output="no"),
], quorum=0.6)
# Divergence detection
result = check_divergence([
BehaviorTrace(agent="a", actions=["search", "summarize", "respond"]),
BehaviorTrace(agent="b", actions=["search", "summarize", "respond"]),
], max_edit_distance=2)
Evaluation
from specops_ai import eval_golden_set, EvalCase, llm_judge
results = eval_golden_set(
agent_fn=my_agent,
cases=[EvalCase(input="2+2", expected="4")],
)
verdict = llm_judge(output, criteria="correctness", judge_fn=my_llm)
RCA Graph
from specops_ai import build_rca_graph, to_dot
graph = build_rca_graph(spans)
print(f"Root causes: {[n.name for n in graph.root_causes]}")
dot_output = to_dot(graph, title="Failure Analysis")
โ ๏ธ SpecOps is in early development (v0.2.0). APIs may change. See the Roadmap.
Features
| Category | Status | Description |
|---|---|---|
| OTel Tracing | โ | Trace agent runs, tool calls, LLM requests with OpenTelemetry spans |
| Replay Engine | โ | Record and replay agent sessions deterministically |
| Eval Harness | โ | Golden-set comparison + LLM-as-judge for behavioral evaluation |
| Self-Healing | โ | Retry with backoff, fallback chains, escalation, memory pruning |
| RCA Graphs | โ | Root-cause analysis from OTel spans, Graphviz DOT export |
| Simulation Sandbox | โ | Test for loops, drift, cascades, and token overflow in a sandbox |
| Coordination Checks | โ | Consensus, memory integrity, and divergence detection for multi-agent systems |
| Framework Adapters | โ | LangGraph, CrewAI, AutoGen adapters (auto-detected) |
Simulation Sandbox
The simulation sandbox lets you test agent behaviors in a controlled environment before they hit production:
- Loop detection โ Catch agents stuck repeating the same action
- Budget enforcement โ Set max steps, duration, and token limits
- Cascade testing โ Simulate failure propagation across agent pipelines
- OTel integration โ All simulation events produce spans for analysis
from specops_ai import simulate, SimulationEnvironment
@simulate("my-scenario", max_steps=100, token_budget=10000)
def test_agent(sim: SimulationEnvironment):
for task in tasks:
sim.record("agent", task)
sim.add_tokens(500)
Multi-Agent Coordination
Built-in checks for multi-agent systems:
| Check | Purpose |
|---|---|
check_consensus() |
Verify agents agree on outputs (configurable quorum) |
check_memory_integrity() |
Detect state divergence and stale reads |
check_divergence() |
Flag behavioral drift via edit distance |
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Your Agent Code โ
โ (LangChain / CrewAI / Custom / etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ SpecOps SDK Layer โ
โ trace ยท eval ยท replay ยท heal ยท simulate โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ OpenTelemetry Protocol โ
โ spans ยท metrics ยท logs โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Any OTel Backend โ
โ Jaeger ยท Grafana ยท Datadog ยท etc. โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Project Structure
specops/
โโโ src/specops_ai/ # Core library
โ โโโ trace.py # OTel tracing decorators
โ โโโ replay.py # Record/replay engine
โ โโโ eval.py # Evaluation harness
โ โโโ heal.py # Self-healing policies
โ โโโ simulate.py # Simulation sandbox
โ โโโ coordinate.py # Multi-agent coordination
โ โโโ rca.py # Root-cause analysis
โ โโโ adapters/ # Framework adapters
โโโ tests/ # Test suite (120+ tests)
โโโ examples/ # Usage examples
โโโ docs/specs/ # Specifications
โโโ pyproject.toml # Build config (hatch + ruff + pytest)
Contributing
We use spec-driven development โ every feature starts as a specification before code is written. See CONTRIBUTING.md for the full workflow.
# Setup
uv sync
# Run tests
uv run pytest
# Lint & format
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Type check
uv run mypy src/
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file specops_ai-0.2.0.tar.gz.
File metadata
- Download URL: specops_ai-0.2.0.tar.gz
- Upload date:
- Size: 391.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10537a97695b7e4aa5a189f5e47d6f84996cf7e9d444082191a202c9870f69f1
|
|
| MD5 |
bd0ed3c9c734c41f64e5ede6da11c528
|
|
| BLAKE2b-256 |
a41d4e55cef6c9512187fc89493321cb5f5b9bc152acf1656c8f62072f35dc71
|
File details
Details for the file specops_ai-0.2.0-py3-none-any.whl.
File metadata
- Download URL: specops_ai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
369b218b418b399520e61945a39eb86afb52c6cd2838511d2336e1f67d5d6fcb
|
|
| MD5 |
8b886a9a57d52fdffcb9f3cd021b9ff4
|
|
| BLAKE2b-256 |
cb37ceaa2792c310250a6d6df7225145c8e97876f55bf3f7e6a9ae52cab70945
|