Reliability library for LLM agents you build (LangGraph / custom loops): detects meltdowns in real time and intercepts them before they spiral.

Project description

Sotis

Sotis watches your LLM agent and catches it before it spirals.

Sotis intercepting a live agent meltdown

Live run: a real LLM agent spirals on a buggy codebase. Sotis detects three meltdowns in real time, intercepts each one, and degrades gracefully (GDS 1.0 → 0.4) instead of letting the agent burn.

_{📄 Full raw terminal transcript of this run: run_groq_llama70b_meltdown_20260601.txt}

pip install sotis

Long-running agents fail in predictable ways — they loop on the same tool calls, flood their context with error traces, and spiral until the task collapses. Sotis detects these failure patterns in real time and transparently resets execution before they take hold.

Based on "Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents" (arXiv:2603.29231, April 2026)

Who it's for

Sotis is a library you add to agents you build. It lives inside your agent's execution loop, so it can do more than observe — it intercepts a spiraling agent, rolls back the files it touched, distills the context, and resumes it.

✅ LangGraph agents — drop in SotisLangGraphGuard as a guard node (USAGE.md)
✅ Custom ReAct / tool-calling loops — wrap your loop with SotisGuard
✅ Any LLM provider — OpenAI, Anthropic, DeepSeek, Google, or OpenAI-compatible endpoints (Groq, OpenRouter)

Sotis is not a plugin for closed agents you don't control — e.g. Claude Code or Codex. Those expose no hook into their loop, so the rollback/reset intervention isn't possible there. If you're building the agent, Sotis fits; if you're using someone else's finished agent, it doesn't.

Scope & limitations

Being clear about what Sotis does not do is as important as what it does:

Reliability, not adversarial security. Sotis assumes the agent is trying to do the right thing and degrading. Entropy/loop detection is a statistical signal, and any threshold-based signal is gameable by definition — an attacker who knows the logic can craft an injection that stays under it. Sotis is a circuit breaker for accidental failure, not a defense against a deliberate adversary. Pair it with a real injection/intent layer if that's your threat model.
It catches loud failures, not quiet corruption. Entropy + loop detection read the shape of the tool-call stream, so they catch the visible spiral — thrashing, repeats, edit storms. They are blind by construction to the quiet failure: an agent confidently proceeding on a state that silently went wrong several steps back (low entropy, no repeats, still corrupt). Catching that needs a semantic, world-state signal (goal-progress regression, contradiction detection) — on the roadmap, not shipped.
Checkpoint "goodness" can be invariant-verified (opt-in). By default Sotis rolls back to the last snapshot. Pass a checkpoint_invariant and rollback instead targets the most recent state that passed your invariant — so if a tool call silently corrupted state before the meltdown was visible, you don't roll back into the poison. A built-in python_imports_cleanly invariant ships (every tracked .py still parses); supply your own for other domains. See Tuning.
It bounds failure; it doesn't guarantee success. Sotis stops the spiral and hands you a clean, recoverable state — it does not make a weak model finish the task.

See the open issues for the roadmap addressing these.

Architecture

An agent goal is decomposed into a DAG of subtasks and executed through the Sotis ReAct runtime or the LangGraph guard. Every step is fed to the core layer — the Shannon entropy monitor, the Jaccard/fingerprint loop detector, and the workspace density guard. When any of them flags a meltdown, the Checkpoint Manager rolls the workspace back to its last stable state, the Context Resetter distills a compact resumption prompt, and execution continues. Step telemetry streams to a JSON-L logger that the Streamlit dashboard renders.

Full architecture diagram, module descriptions, and design decisions: ARCHITECTURE.md

Usage

from sotis import SotisGuard

guard = SotisGuard()

for step in range(max_steps):
    action = agent.decide()
    result = tools.execute(action)

    meltdown = guard.watch(action.name, action.args, result.summary)

    if meltdown:
        guard.reset()  # rolls back files, distills context, resumes cleanly

What it looks like in practice

Illustrative output of a healthy intercept-and-recover cycle. For a real, unedited run (where a weaker model spirals and Sotis intercepts 3 meltdowns), see the demo GIF and raw transcript above.

[Step 22] write_file -> {"path": "src/main.py", "content": "import math"} | SUCCESS
[Step 23] run_tests  -> {"cmd": "pytest"} | FAIL (ImportError)
[Step 24] write_file -> {"path": "src/main.py", "content": "import math"} | SUCCESS
[Step 25] run_tests  -> {"cmd": "pytest"} | FAIL (ImportError)

[WARNING]   Anomaly detected: Workspace edit storm and exact argument loops
[INTERCEPT] Sotis Meltdown Interception Triggered!
[RECOVER]   Restored workspace files to stable baseline (step 22 diff)
[RECOVER]   Distilled session context history (86% token savings)
[RESUME]    Injecting resumption briefing into agent context...

[Step 26] grep_search -> {"query": "math"} | Execution resumed cleanly

CLI

sotis dashboard    # Launch the Streamlit observability dashboard
sotis benchmark    # Run the empirical benchmark suite
sotis demo         # Run the built-in meltdown/recovery demo

The dashboard reads session telemetry from logs/. Point it elsewhere with sotis dashboard --logs <path> or the SOTIS_LOG_DIR environment variable — set the same value when running your agent so both write and read the same directory regardless of where each is launched.

Full command reference and the LangGraph integration guide: USAGE.md

Tuning

The default entropy threshold (1.5 bits) is calibrated for agents that use 1-2 tools in tight loops. If your agent legitimately uses 3+ different tools in a short window, the default will fire false positives — log2(3) = 1.585 > 1.5.

Raise the threshold for multi-tool agents:

from sotis import SotisGuard
from sotis.core.entropy import EntropyConfig

guard = SotisGuard(entropy_config=EntropyConfig(hard_threshold=2.7))

Threshold	Behavior
`1.5` (default)	Catches tight loops fast. Will false-positive on diverse tool usage.
`2.0`	Good balance for agents using 3-4 tools regularly.
`2.7`	Permissive — only fires on genuine chaotic switching across 6+ tools.

Validated in the detection gauntlet: the default fired a false positive on healthy diverse work, raising to 2.7 eliminated it while keeping 100% true-positive detection.

Beyond the fixed threshold, three opt-in detectors (full details in USAGE.md):

Adaptive threshold — learns a per-agent baseline and triggers at mean + 2σ instead of a global number, so diverse agents stop false-positiving: EntropyConfig(adaptive=True).
Token-spike corroboration — flags a sudden jump in tokens/step (≥3× the rolling mean) as a corroborating signal, not a sole trigger: SotisGuard(token_spike_factor=3.0).
Invariant-verified checkpoints — rollback targets the most recent state that passed your invariant, not just the last snapshot (which may be the poisoned one): checkpoint_invariant=python_imports_cleanly. Ships a built-in Python invariant; supply any Callable[[dict[str,str]], bool].

Active Stabilization, Not Passive Tracing

Tools like LangSmith, Langfuse, and Helicone log what happened after your agent already spent $20 looping in production.

Sotis intervenes during execution. It intercepts spiraling tool calls, rolls back uncommitted file edits, distills conversation history, and redirects the model's reasoning loop — before the damage accumulates.

Capabilities

Capability	Description
Meltdown Detection	Sliding-window Shannon entropy + exact/semantic loop detection
Adaptive Threshold	Per-agent baseline (`mean + 2σ`) instead of a fixed global cutoff
Token-Spike Signal	Corroborates a meltdown when tokens/step suddenly jump
Workspace Density Guard	Detects infinite same-file edit cycles
Verified-Checkpoint Reset	Rolls back to a state proven good by your invariant, not just the last snapshot
Transparent Reset	Git-diff checkpointing + distilled context rebuild (~86% token savings)
Graceful Degradation	GDS scoring preserves partial progress across resets
LangGraph Integration	Native guard node — intercepts state, rolls back files
LLM Support	OpenAI, Anthropic, DeepSeek, Google, any OpenAI-compatible endpoint
Observability	Streamlit dashboard + structured JSON session logs

The Science

Sotis operationalizes the formal reliability framework from "Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents" (arXiv:2603.29231, April 2026).

Four key findings from the paper that Sotis directly addresses:

Meltdown Onset Point (MOP) — the paper quantifies the transition from coherent planning to chaotic looping via sliding-window Shannon entropy. Sotis implements this as a live runtime monitor over a 5-step window, with either a fixed or per-agent adaptive threshold.

Super-linear reliability decay — agent success rates decay faster than mathematically expected because errors are positively correlated across steps. A confused agent stays confused. Sotis acts as a circuit breaker that resets the error correlation coefficient by starting fresh from a verified checkpoint.

Episodic memory failures — the paper demonstrates that naive memory scaffolds universally degrade long-horizon performance by accumulating context overhead. Sotis uses controlled checkpointed resets instead of continuous memory accumulation.

Graceful Degradation Score (GDS) — rather than binary pass/fail, Sotis scores partial task completion using weighted subtask graphs, preserving measured progress across reset boundaries.

Performance

Metric	Result
Entropy + loop detection latency	< 0.2ms per step
Context distillation token reduction	86.14% (BPE cl100k_base)
Test suite	159 tests passing
Live recovery	Verified on circular-import and AST recursive-loop traps
Verified-checkpoint rollback (live)	On a real Groq Llama-3.3-70B run, rolled back to a verified-good checkpoint, not the corrupt snapshot (log)
Local model validation (mistral:latest via Ollama)	Caught a real TOOL_LOOP meltdown and corrected agent behavior after reset
Detection accuracy (6-scenario gauntlet)	100% true positive rate, 0% false negatives
Total API cost for full validation suite	< $0.01 (Groq free tier)

Full empirical ledger: performance_metrics.txt

Real-world validation logs: ExperimentLog/real_world_validation/

Local model run (Ollama — mistral:latest meltdown + gemma3:4b tool-binding notes): sotis_gemma_mistral_run_review.txt

Benchmarks

Reliability decay in frontier LLM agents as task horizon grows (Khanal et al. 2026 — arXiv:2603.29231)

Reliability Decay

Context distillation: token reduction after meltdown reset (measured with tiktoken BPE cl100k_base)

Token Reduction

Real agent experiments — meltdown intercepts and outcomes (Gemini 3.5 · Groq Llama 70B · Mistral local · OpenRouter Gemini)

Real Experiments

Project Structure

sotis/
  core/     # Entropy, loop detection, checkpoint, decomposition, GDS
  lib/      # ReAct runtime, LangGraph integration, LLM adapters
  obs/      # Streamlit dashboard + structured JSON logger
  bench/    # Benchmark harness and task generators

License

MIT

Project details

Release history Release notifications | RSS feed

This version

1.2.0

Jun 5, 2026

1.1.4

Jun 1, 2026

1.1.3

Jun 1, 2026

1.1.2

May 31, 2026

1.1.1

May 31, 2026

1.1.0

May 31, 2026

1.0.2

May 28, 2026

1.0.1

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sotis-1.2.0.tar.gz (63.7 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sotis-1.2.0-py3-none-any.whl (66.3 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file sotis-1.2.0.tar.gz.

File metadata

Download URL: sotis-1.2.0.tar.gz
Upload date: Jun 5, 2026
Size: 63.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sotis-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`339e2836f25c8b5a223d4c3bc8cd5ef4bcd0a070494655daa716c3abf2c1510a`
MD5	`b9af38cb20150bb323e413b2fe1775e2`
BLAKE2b-256	`fcf3f0e66baa9ec2338651793e5bab75849066ea673ed44469b77581a0301076`

See more details on using hashes here.

File details

Details for the file sotis-1.2.0-py3-none-any.whl.

File metadata

Download URL: sotis-1.2.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 66.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sotis-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`257fdbc8754c02a7fa7d736038294da80d9b2f506686c644d3729c1873296390`
MD5	`d314a64d6b0c598579b0e8cf22f0faec`
BLAKE2b-256	`84482ffb8c21ded195a06ba3f4d15f9906fd8f8b1084b1891197e954ce8e21ab`

See more details on using hashes here.

sotis 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Sotis

Who it's for

Scope & limitations

Architecture

Usage

What it looks like in practice

CLI

Tuning

Active Stabilization, Not Passive Tracing

Capabilities

The Science

Performance

Benchmarks

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes