Skip to main content

Framework-agnostic, OTel-native toolkit for reliable, evaluatable, debuggable, and self-healing LLM agents in production

Project description

๐Ÿ› ๏ธ SpecOps

Agent Reliability Kit

Framework-agnostic, OTel-native toolkit for reliable, evaluatable, debuggable, and self-healing LLM agents in production.

PyPI Python CI Coverage License

Getting Started โ€ข Features โ€ข Simulation โ€ข Coordination โ€ข Roadmap โ€ข Contributing


The Problem

LLM agents fail silently. They hallucinate, loop, drift off-task, and degrade without warning. Teams building agentic systems today lack:

  • Observability โ€” No standardized way to trace agent reasoning, tool calls, and decision paths
  • Evaluation โ€” No framework-agnostic way to measure if agents actually do what they're supposed to
  • Debugging โ€” When agents fail, root-cause analysis is guesswork
  • Self-healing โ€” Agents crash and stay crashed; no recovery patterns exist
  • Simulation โ€” No way to test for emergent failures before they hit production

Getting Started

Installation

pip install specops-ai

With framework adapters:

pip install specops-ai[langgraph]   # LangGraph support
pip install specops-ai[crewai]      # CrewAI support
pip install specops-ai[all]         # All adapters

One-Line Quickstart

from specops_ai import trace_agent

@trace_agent(name="my-agent")
def agent(task: str) -> str:
    return "done"  # Your agent logic โ€” now fully traced via OTel

Trace Any Agent

from specops_ai import trace_agent, trace_tool, trace_llm

@trace_tool(name="search")
def search(query: str) -> list[str]:
    return ["result1", "result2"]

@trace_llm(model="gpt-4o", provider="openai")
def call_llm(prompt: str) -> dict:
    return {"text": "...", "model": "gpt-4o", "input_tokens": 10, "output_tokens": 25}

@trace_agent(name="research-agent")
def agent(task: str) -> str:
    results = search(task)
    return call_llm(f"Summarize: {results}")["text"]

Record & Replay

from specops_ai import replayable, recording, replaying

@replayable
def call_llm(prompt: str) -> str:
    return "..."  # Your LLM call

# Record
with recording(session_id="session-1", seed=42) as session:
    result = call_llm("What is 2+2?")

# Replay deterministically
with replaying("session-1"):
    same_result = call_llm("What is 2+2?")  # Identical output

Self-Healing

from specops_ai import self_healing, RetryPolicy, FallbackPolicy

@self_healing(
    retry=RetryPolicy(max_retries=3, base_delay=0.5),
    fallback=FallbackPolicy(fallback_fn=backup_llm),
)
def call_llm(prompt: str) -> str:
    ...  # Auto-retries, falls back if exhausted

Simulation Sandbox

from specops_ai import simulation

with simulation("loop-test", max_steps=50, loop_threshold=3) as sim:
    for action in agent_actions:
        event = sim.record("my-agent", action)
        if event.anomaly:
            print(f"Detected: {event.anomaly.value}")
    result = sim.stop()
    assert result.passed

Multi-Agent Coordination

from specops_ai import check_consensus, check_divergence, AgentOutput, BehaviorTrace

# Consensus check
result = check_consensus([
    AgentOutput(agent="a", output="yes"),
    AgentOutput(agent="b", output="yes"),
    AgentOutput(agent="c", output="no"),
], quorum=0.6)

# Divergence detection
result = check_divergence([
    BehaviorTrace(agent="a", actions=["search", "summarize", "respond"]),
    BehaviorTrace(agent="b", actions=["search", "summarize", "respond"]),
], max_edit_distance=2)

Evaluation

from specops_ai import eval_golden_set, EvalCase, llm_judge

results = eval_golden_set(
    agent_fn=my_agent,
    cases=[EvalCase(input="2+2", expected="4")],
)

verdict = llm_judge(output, criteria="correctness", judge_fn=my_llm)

RCA Graph

from specops_ai import build_rca_graph, to_dot

graph = build_rca_graph(spans)
print(f"Root causes: {[n.name for n in graph.root_causes]}")
dot_output = to_dot(graph, title="Failure Analysis")

โš ๏ธ SpecOps is in early development (v0.2.0). APIs may change. See the Roadmap.

Features

Category Status Description
OTel Tracing โœ… Trace agent runs, tool calls, LLM requests with OpenTelemetry spans
Replay Engine โœ… Record and replay agent sessions deterministically
Eval Harness โœ… Golden-set comparison + LLM-as-judge for behavioral evaluation
Self-Healing โœ… Retry with backoff, fallback chains, escalation, memory pruning
RCA Graphs โœ… Root-cause analysis from OTel spans, Graphviz DOT export
Simulation Sandbox โœ… Test for loops, drift, cascades, and token overflow in a sandbox
Coordination Checks โœ… Consensus, memory integrity, and divergence detection for multi-agent systems
Framework Adapters โœ… LangGraph, CrewAI, AutoGen adapters (auto-detected)

Simulation Sandbox

The simulation sandbox lets you test agent behaviors in a controlled environment before they hit production:

  • Loop detection โ€” Catch agents stuck repeating the same action
  • Budget enforcement โ€” Set max steps, duration, and token limits
  • Cascade testing โ€” Simulate failure propagation across agent pipelines
  • OTel integration โ€” All simulation events produce spans for analysis
from specops_ai import simulate, SimulationEnvironment

@simulate("my-scenario", max_steps=100, token_budget=10000)
def test_agent(sim: SimulationEnvironment):
    for task in tasks:
        sim.record("agent", task)
        sim.add_tokens(500)

Multi-Agent Coordination

Built-in checks for multi-agent systems:

Check Purpose
check_consensus() Verify agents agree on outputs (configurable quorum)
check_memory_integrity() Detect state divergence and stale reads
check_divergence() Flag behavioral drift via edit distance

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Your Agent Code                 โ”‚
โ”‚  (LangChain / CrewAI / Custom / etc.)       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚            SpecOps SDK Layer                 โ”‚
โ”‚  trace ยท eval ยท replay ยท heal ยท simulate    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         OpenTelemetry Protocol               โ”‚
โ”‚  spans ยท metrics ยท logs                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           Any OTel Backend                   โ”‚
โ”‚  Jaeger ยท Grafana ยท Datadog ยท etc.          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Project Structure

specops/
โ”œโ”€โ”€ src/specops_ai/       # Core library
โ”‚   โ”œโ”€โ”€ trace.py          # OTel tracing decorators
โ”‚   โ”œโ”€โ”€ replay.py         # Record/replay engine
โ”‚   โ”œโ”€โ”€ eval.py           # Evaluation harness
โ”‚   โ”œโ”€โ”€ heal.py           # Self-healing policies
โ”‚   โ”œโ”€โ”€ simulate.py       # Simulation sandbox
โ”‚   โ”œโ”€โ”€ coordinate.py     # Multi-agent coordination
โ”‚   โ”œโ”€โ”€ rca.py            # Root-cause analysis
โ”‚   โ””โ”€โ”€ adapters/         # Framework adapters
โ”œโ”€โ”€ tests/                # Test suite (120+ tests)
โ”œโ”€โ”€ examples/             # Usage examples
โ”œโ”€โ”€ docs/specs/           # Specifications
โ””โ”€โ”€ pyproject.toml        # Build config (hatch + ruff + pytest)

Contributing

We use spec-driven development โ€” every feature starts as a specification before code is written. See CONTRIBUTING.md for the full workflow.

# Setup
uv sync

# Run tests
uv run pytest

# Lint & format
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Type check
uv run mypy src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

specops_ai-0.2.0.tar.gz (391.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

specops_ai-0.2.0-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file specops_ai-0.2.0.tar.gz.

File metadata

  • Download URL: specops_ai-0.2.0.tar.gz
  • Upload date:
  • Size: 391.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for specops_ai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 10537a97695b7e4aa5a189f5e47d6f84996cf7e9d444082191a202c9870f69f1
MD5 bd0ed3c9c734c41f64e5ede6da11c528
BLAKE2b-256 a41d4e55cef6c9512187fc89493321cb5f5b9bc152acf1656c8f62072f35dc71

See more details on using hashes here.

File details

Details for the file specops_ai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: specops_ai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for specops_ai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 369b218b418b399520e61945a39eb86afb52c6cd2838511d2336e1f67d5d6fcb
MD5 8b886a9a57d52fdffcb9f3cd021b9ff4
BLAKE2b-256 cb37ceaa2792c310250a6d6df7225145c8e97876f55bf3f7e6a9ae52cab70945

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page