Framework-agnostic, OTel-native toolkit for reliable, evaluatable, debuggable, and self-healing LLM agents in production
Project description
Getting Started • Features • Examples • Simulation • Coordination • Roadmap • Contributing
SpecOps AI is a lightweight, framework-agnostic, OpenTelemetry-native toolkit that makes LLM agents and multi-agent systems truly reliable in production.
It gives you powerful primitives — deterministic replay, behavioral evaluation, self-healing policies, root-cause analysis graphs, simulation sandbox, and coordination checks — so you can stop worrying about hallucinations, infinite loops, memory drift, and mysterious failures that break most agents outside the lab.
Whether you're a new engineer just getting started with agents, an experienced builder shipping complex multi-agent workflows with LangGraph, CrewAI, AutoGen or Strands, or an enterprise team that needs production-grade observability and resilience, SpecOps provides the missing "reliability layer" that turns fragile demos into trustworthy systems.
Zero-config decorators, beautiful examples across OpenAI, Anthropic, and Grok, and full MIT-licensed freedom — install it in seconds and start building agents you can actually trust.
The Problem
LLM agents fail silently. They hallucinate, loop, drift off-task, and degrade without warning. Teams building agentic systems today lack:
- Observability — No standardized way to trace agent reasoning, tool calls, and decision paths
- Evaluation — No framework-agnostic way to measure if agents actually do what they're supposed to
- Debugging — When agents fail, root-cause analysis is guesswork
- Self-healing — Agents crash and stay crashed; no recovery patterns exist
- Simulation — No way to test for emergent failures before they hit production
Getting Started
Installation
pip install specops-ai
With framework adapters:
pip install specops-ai[langgraph] # LangGraph support
pip install specops-ai[crewai] # CrewAI support
pip install specops-ai[strands] # Strands support
pip install specops-ai[all] # All adapters
One-Line Quickstart
from specops_ai import trace_agent
@trace_agent(name="my-agent")
def agent(task: str) -> str:
return "done" # Your agent logic — now fully traced via OTel
Trace Any Agent
from specops_ai import trace_agent, trace_tool, trace_llm
@trace_tool(name="search")
def search(query: str) -> list[str]:
return ["result1", "result2"]
@trace_llm(model="gpt-4o", provider="openai")
def call_llm(prompt: str) -> dict:
return {"text": "...", "model": "gpt-4o", "input_tokens": 10, "output_tokens": 25}
@trace_agent(name="research-agent")
def agent(task: str) -> str:
results = search(task)
return call_llm(f"Summarize: {results}")["text"]
Record & Replay
from specops_ai import replayable, recording, replaying
@replayable
def call_llm(prompt: str) -> str:
return "..." # Your LLM call
# Record
with recording(session_id="session-1", seed=42) as session:
result = call_llm("What is 2+2?")
# Replay deterministically
with replaying("session-1"):
same_result = call_llm("What is 2+2?") # Identical output
Self-Healing
from specops_ai import self_healing, RetryPolicy, FallbackPolicy
@self_healing(
retry=RetryPolicy(max_retries=3, base_delay=0.5),
fallback=FallbackPolicy(fallback_fn=backup_llm),
)
def call_llm(prompt: str) -> str:
... # Auto-retries, falls back if exhausted
Simulation Sandbox
from specops_ai import simulation
with simulation("loop-test", max_steps=50, loop_threshold=3) as sim:
for action in agent_actions:
event = sim.record("my-agent", action)
if event.anomaly:
print(f"Detected: {event.anomaly.value}")
result = sim.stop()
assert result.passed
Multi-Agent Coordination
from specops_ai import check_consensus, check_divergence, AgentOutput, BehaviorTrace
# Consensus check
result = check_consensus([
AgentOutput(agent="a", output="yes"),
AgentOutput(agent="b", output="yes"),
AgentOutput(agent="c", output="no"),
], quorum=0.6)
# Divergence detection
result = check_divergence([
BehaviorTrace(agent="a", actions=["search", "summarize", "respond"]),
BehaviorTrace(agent="b", actions=["search", "summarize", "respond"]),
], max_edit_distance=2)
Evaluation
from specops_ai import eval_golden_set, EvalCase, llm_judge
results = eval_golden_set(
agent_fn=my_agent,
cases=[EvalCase(input="2+2", expected="4")],
)
verdict = llm_judge(output, criteria="correctness", judge_fn=my_llm)
RCA Graph
from specops_ai import build_rca_graph, to_dot
graph = build_rca_graph(spans)
print(f"Root causes: {[n.name for n in graph.root_causes]}")
dot_output = to_dot(graph, title="Failure Analysis")
⚠️ SpecOps is in early development (v0.2.0). APIs may change. See the Roadmap.
Features
| Category | Status | Description |
|---|---|---|
| OTel Tracing | ✅ | Trace agent runs, tool calls, LLM requests with OpenTelemetry spans |
| Replay Engine | ✅ | Record and replay agent sessions deterministically |
| Eval Harness | ✅ | Golden-set comparison + LLM-as-judge for behavioral evaluation |
| Self-Healing | ✅ | Retry with backoff, fallback chains, escalation, memory pruning |
| RCA Graphs | ✅ | Root-cause analysis from OTel spans, Graphviz DOT export |
| Simulation Sandbox | ✅ | Test for loops, drift, cascades, and token overflow in a sandbox |
| Coordination Checks | ✅ | Consensus, memory integrity, and divergence detection for multi-agent systems |
| Framework Adapters | ✅ | LangGraph, CrewAI, AutoGen, Strands adapters (auto-detected) |
Simulation Sandbox
The simulation sandbox lets you test agent behaviors in a controlled environment before they hit production:
- Loop detection — Catch agents stuck repeating the same action
- Budget enforcement — Set max steps, duration, and token limits
- Cascade testing — Simulate failure propagation across agent pipelines
- OTel integration — All simulation events produce spans for analysis
from specops_ai import simulate, SimulationEnvironment
@simulate("my-scenario", max_steps=100, token_budget=10000)
def test_agent(sim: SimulationEnvironment):
for task in tasks:
sim.record("agent", task)
sim.add_tokens(500)
Multi-Agent Coordination
Built-in checks for multi-agent systems:
| Check | Purpose |
|---|---|
check_consensus() |
Verify agents agree on outputs (configurable quorum) |
check_memory_integrity() |
Detect state divergence and stale reads |
check_divergence() |
Flag behavioral drift via edit distance |
Architecture
┌─────────────────────────────────────────────┐
│ Your Agent Code │
│ (LangChain / CrewAI / Custom / etc.) │
├─────────────────────────────────────────────┤
│ SpecOps SDK Layer │
│ trace · eval · replay · heal · simulate │
├─────────────────────────────────────────────┤
│ OpenTelemetry Protocol │
│ spans · metrics · logs │
├─────────────────────────────────────────────┤
│ Any OTel Backend │
│ Jaeger · Grafana · Datadog · etc. │
└─────────────────────────────────────────────┘
Project Structure
specops/
├── src/specops_ai/ # Core library
│ ├── trace.py # OTel tracing decorators
│ ├── replay.py # Record/replay engine
│ ├── eval.py # Evaluation harness
│ ├── heal.py # Self-healing policies
│ ├── simulate.py # Simulation sandbox
│ ├── coordinate.py # Multi-agent coordination
│ ├── rca.py # Root-cause analysis
│ └── adapters/ # Framework adapters
├── tests/ # Test suite (120+ tests)
├── examples/ # Usage examples
│ ├── providers/ # Provider-specific (require API keys)
│ │ ├── openai/ # OpenAI / LangGraph examples
│ │ ├── anthropic/ # Anthropic examples (coming soon)
│ │ └── grok/ # Grok examples (coming soon)
│ └── shared/ # Shared utilities (key loading, graceful skip)
├── docs/specs/ # Specifications
└── pyproject.toml # Build config (hatch + ruff + pytest)
Running the Examples
SpecOps ships with a rich set of examples covering every module. All examples run with a single command — no complex setup required.
Quick Start
# 1. Install the package
uv sync
# 2. Run any core example immediately (no API keys needed)
uv run examples/plain_agent.py
Core Examples (No API Key Required)
These examples demonstrate SpecOps features using mocked LLM calls — perfect for learning and CI:
| Example | Module | Description |
|---|---|---|
plain_agent.py |
Tracing | Simple research agent with search + LLM tracing |
async_pipeline.py |
Tracing | Async multi-agent pipeline with nested spans |
langgraph_agent.py |
Adapters | StateGraph-style agent with tool routing |
crewai_agent.py |
Adapters | Multi-agent crew (researcher + writer) |
replay_basic.py |
Replay | Record and replay agent sessions deterministically |
replay_async_eval.py |
Replay + Eval | Async replay with evaluation harness |
eval_golden_set.py |
Eval | Golden-set evaluation with LLM-as-judge |
self_healing_basic.py |
Heal | Retry and fallback policies |
self_healing_advanced.py |
Heal | Escalation and memory pruning strategies |
rca_analysis.py |
RCA | Root-cause analysis graph from OTel spans |
simulation_loops.py |
Simulation | Detect agent loops in a sandbox |
simulation_cascade.py |
Simulation | Test cascading failures across agents |
simulation_demo.py |
Simulation | Full simulation sandbox walkthrough |
multi_agent_coordination.py |
Coordination | Consensus voting and divergence detection |
# Run any core example
uv run examples/replay_basic.py
uv run examples/self_healing_advanced.py
uv run examples/simulation_demo.py
Provider Examples (API Key Required)
Provider examples connect to real LLM APIs. Each provider directory contains the same five examples for easy comparison:
| Example | Framework | Description |
|---|---|---|
basic_agent.py |
Direct API | Simple traced agent call |
langgraph_agent.py |
LangGraph | StateGraph agent with tool routing |
crewai_agent.py |
CrewAI | Multi-agent crew orchestration |
autogen_agent.py |
AutoGen | Multi-agent conversation |
strands_agent.py |
Strands | Tool-use agent with Strands SDK |
Available Providers
| Provider | Directory | Required Key |
|---|---|---|
| OpenAI | examples/providers/openai/ |
OPENAI_API_KEY |
| Anthropic | examples/providers/anthropic/ |
ANTHROPIC_API_KEY |
| Grok (xAI) | examples/providers/grok/ |
GROK_API_KEY |
Setup
# 1. Copy the environment template
cp .env.example .env
# 2. Add your API key(s) — only the providers you need
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# GROK_API_KEY=xai-...
# 3. Run a provider example
uv run examples/providers/openai/basic_agent.py
uv run examples/providers/anthropic/langgraph_agent.py
uv run examples/providers/grok/crewai_agent.py
uv run examples/providers/openai/strands_agent.py
💡 Provider examples exit gracefully with a helpful message if the required API key is missing.
Mock Mode (No API Key Needed)
Run any provider example without a real API key using mock mode — ideal for CI pipelines and quick testing:
SPECOPS_EXAMPLE_MODE=mock uv run examples/providers/openai/langgraph_agent.py
SPECOPS_EXAMPLE_MODE=mock uv run examples/providers/anthropic/autogen_agent.py
SPECOPS_EXAMPLE_MODE=mock uv run examples/providers/grok/strands_agent.py
Viewing Traces
By default, traces are printed to the console. To send traces to an OTel-compatible backend like Jaeger:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
uv run examples/plain_agent.py
Contributing
We use spec-driven development — every feature starts as a specification before code is written. See CONTRIBUTING.md for the full workflow.
# Setup
uv sync
# Run tests
uv run pytest
# Lint & format
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Type check
uv run mypy src/
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file specops_ai-0.3.1.tar.gz.
File metadata
- Download URL: specops_ai-0.3.1.tar.gz
- Upload date:
- Size: 684.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e23e1ef1880cdacd531c9a7b35669fdf0395e476541a04d41c169092533f62b8
|
|
| MD5 |
cea67a8a58a459ae57dc04b3912cdc30
|
|
| BLAKE2b-256 |
9a0e3c64971eea5456c407c51abcd0afb310a94abbb7a60859f69d8c2591a4ab
|
File details
Details for the file specops_ai-0.3.1-py3-none-any.whl.
File metadata
- Download URL: specops_ai-0.3.1-py3-none-any.whl
- Upload date:
- Size: 33.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9267a4785fbfaaf796e1982d137af75139bdb30a4bb8b15d063b6ae22cbe79a
|
|
| MD5 |
39b6f05981f22c97c38cfd6fe3e73738
|
|
| BLAKE2b-256 |
87c85c9b866a9e8e6fd17e189f85c172b523fb5d597e53724ddada6678727088
|