Classify why your agent failed. Recover intelligently.
Project description
triage
Classify why your agent failed. Recover intelligently.
pip install triage-agent
The problem
Current agent frameworks know that your agent failed. They don't know why — and without knowing why, every failure gets the same blunt response: retry from scratch or give up.
triage adds a classification-and-routing layer between the failure and the recovery:
agent fails → classify failure type → route to matching strategy → recover
It works with any async agent callable — OpenAI, LangGraph, CrewAI, raw LLM loops — without requiring you to change your framework.
Installation
# Core only
pip install triage-agent
# With framework adapters
pip install "triage-agent[langgraph]"
pip install "triage-agent[crewai]"
pip install "triage-agent[openai-agents]"
pip install "triage-agent[langchain]"
# With LLM-based classifier
pip install "triage-agent[anthropic]"
# With durable checkpoint storage
pip install "triage-agent[sqlite]"
pip install "triage-agent[redis]"
Python 3.10+ required. Core dependencies: anyio>=4.0, pydantic>=2.0.
Quick start
import triage
from triage.strategies.retry import retry_with_tool_manifest, backoff_and_retry
from triage.strategies.replan import replan
from triage.strategies.rollback import rollback_to_checkpoint
from triage.taxonomy import Step
# 1. Define your agent — it receives record_step and update_state callbacks
async def my_agent(task: str, *, record_step, update_state, _triage_hint=None, **kwargs):
# ... your agent logic ...
data = fetch_data(task)
record_step(Step(index=0, action="called search", tool_called="search",
tool_input={"q": task}, tool_output=data))
update_state({"data": data}) # persisted into checkpoints; restored on rollback
return "done"
# 2. Declare a recovery policy
policy = triage.FailurePolicy(
WRONG_TOOL_CALLED = retry_with_tool_manifest(max_attempts=3),
EXTERNAL_FAULT = backoff_and_retry(max_attempts=5),
LOOP_DETECTED = replan(hint="Try a different approach."),
HALLUCINATED_STATE = rollback_to_checkpoint(),
default = triage.FailurePolicy.escalate_by_default(),
)
# 3. Wrap and run
agent = triage.Agent(my_agent, policy=policy)
result = await agent.run("search for recent AI papers")
Or use the decorator form:
@triage.agent(policy=policy)
async def my_agent(task: str, *, record_step, **kwargs):
...
Framework adapters
Drop-in wrappers let you add triage to an existing agent without changing its internals.
LangGraph
from triage.adapters.langgraph import wrap_langgraph
agent = wrap_langgraph(compiled_graph, policy=policy)
result = await agent.run("your task")
Streams events via graph.astream_events(..., version="v2") to capture tool calls and LLM turns.
CrewAI
from triage.adapters.crewai import wrap_crewai
agent = wrap_crewai(crew, policy=policy)
result = await agent.run("your task")
Patches crew.step_callback for each run (original restored in finally).
OpenAI Agents SDK
from triage.adapters.openai_agents import wrap_openai_agents
agent = wrap_openai_agents(sdk_agent, policy=policy)
result = await agent.run("your task")
Uses Runner.run_streamed and iterates stream_events().
LangChain
from triage.adapters.langchain import wrap_langchain
agent = wrap_langchain(executor, policy=policy)
result = await agent.run("your task")
Injects a fresh BaseCallbackHandler per call via config={"callbacks": [...]}.
All adapters accept the same optional kwargs as triage.Agent: classifier, checkpoint_store, max_recovery_attempts, auto_checkpoint.
How it works
1. Record steps
Your agent calls record_step(Step(...)) for each observable action. triage injects the callback — you don't need to import or construct anything:
async def my_agent(task: str, *, record_step, **kwargs):
result = call_tool("search", {"q": task})
record_step(Step(
index=0,
action="called search tool",
tool_called="search",
tool_input={"q": task},
tool_output=result,
))
2. Classify the failure
When your agent raises an exception, triage runs the classifier over the recorded trajectory and returns one of 10 FailureType values:
| FailureType | Trigger | Default recovery |
|---|---|---|
WRONG_TOOL_CALLED |
Error matches "tool not found" / "no tool named" |
Retry with correct manifest |
CONSTRAINT_IGNORED |
LLM output contains a forbidden string | Replan with constraint reminder |
LOOP_DETECTED |
Last 3 steps identical tool + input | Replan or rollback |
HALLUCINATED_STATE |
Agent asserts facts contradicting tool output | Rollback to checkpoint |
PLAN_INCOMPLETE |
Success declared but sub-goals incomplete | Resume from subgoal |
SCHEMA_MISMATCH |
Error matches "validation error" / JSON parse failure |
Retry with schema hint |
CONTEXT_OVERFLOW |
Agent lost earlier context | Replan with compressed context |
GOAL_DRIFT |
Agent making progress toward the wrong goal | Replan with goal restatement |
EXTERNAL_FAULT |
HTTP 429 / 500 / 502 / 503 in error | Exponential backoff + retry |
UNKNOWN |
None of the above | Escalate to human |
The default RulesClassifier is pattern-based and makes zero API calls. For semantic classification use LLMClassifier, or use HybridClassifier to get the best of both:
from triage.classifier.llm import LLMClassifier
from triage.classifier.hybrid import HybridClassifier
# LLM only — every failure classified by Claude
agent = triage.Agent(
my_agent,
policy=policy,
classifier=LLMClassifier(model="claude-haiku-4-5-20251001"),
)
# Hybrid — rules first, LLM only when rules return UNKNOWN (~20% of failures)
agent = triage.Agent(
my_agent,
policy=policy,
classifier=HybridClassifier(llm=LLMClassifier()),
)
LLMClassifier supports Anthropic and any OpenAI-compatible provider. Configure via constructor args or env vars:
# Anthropic (default)
ANTHROPIC_API_KEY=sk-ant-... python my_agent.py
# Ollama (local, no key)
TRIAGE_LLM_BASE_URL=http://localhost:11434/v1 TRIAGE_LLM_MODEL=llama3.2 python my_agent.py
# Groq
TRIAGE_LLM_BASE_URL=https://api.groq.com/openai/v1 TRIAGE_LLM_API_KEY=gsk_... TRIAGE_LLM_MODEL=llama-3.1-8b-instant python my_agent.py
Or pass explicitly:
LLMClassifier(base_url="http://localhost:11434/v1", model="llama3.2")
LLMClassifier falls back to UNKNOWN silently on any error. Requires pip install "triage-agent[anthropic]" for Anthropic, or pip install openai for any OpenAI-compatible provider.
3. Dispatch to a strategy
The policy maps each FailureType to a strategy callable. The strategy returns a RecoveryAction that tells triage what to do next.
4. Execute the recovery
triage executes the action and re-runs your agent with injected context:
| Action | What happens |
|---|---|
RETRY |
Re-runs the agent; injects _triage_hint into kwargs |
REPLAN |
Re-runs the agent; injects _triage_hint with new plan instruction |
ROLLBACK |
Restores trajectory from checkpoint, re-runs agent |
RESUME |
Re-runs agent; injects _triage_subgoal pointing at incomplete subgoal |
ESCALATE |
Raises TriageEscalationError(message, context) |
ABORT |
Raises TriageAbortError(reason, context) |
Failure policy
FailurePolicy is a plain dataclass — one field per FailureType:
policy = triage.FailurePolicy(
WRONG_TOOL_CALLED = retry_with_tool_manifest(max_attempts=3),
CONSTRAINT_IGNORED = replan(hint="Re-read the task constraints carefully."),
LOOP_DETECTED = replan(max_replans=2),
HALLUCINATED_STATE = rollback_to_checkpoint(),
PLAN_INCOMPLETE = resume_from_subgoal(),
SCHEMA_MISMATCH = retry_with_tool_manifest(max_attempts=2),
EXTERNAL_FAULT = backoff_and_retry(max_attempts=5),
default = triage.FailurePolicy.escalate_by_default(),
)
Any FailureType not explicitly listed falls through to default. If default is also unset, triage escalates automatically.
Built-in strategies
triage.strategies.retry
from triage.strategies.retry import retry_with_tool_manifest, backoff_and_retry
# Retry with a hint to use the correct tool manifest
retry_with_tool_manifest(max_attempts=3)
# Retry with exponential backoff (2^attempt seconds). Good for rate limits.
backoff_and_retry(max_attempts=5)
triage.strategies.replan
from triage.strategies.replan import replan, resume_from_subgoal
# Restart with a new plan, optionally injecting a hint
replan(hint="The previous approach used the wrong API endpoint.")
# Continue from the first incomplete sub-goal
resume_from_subgoal()
triage.strategies.rollback
from triage.strategies.rollback import rollback_to_checkpoint
# Restore to latest checkpoint (or a named one)
rollback_to_checkpoint()
rollback_to_checkpoint(checkpoint_id="before-api-call")
Checkpoints
Save agent state at key points so triage can roll back to them on failure.
In-memory (default)
from triage.checkpoint import InMemoryCheckpointStore
store = InMemoryCheckpointStore()
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store)
SQLite (persistent, single-process)
pip install "triage-agent[sqlite]"
from triage.checkpoint.sqlite import SQLiteCheckpointStore
store = SQLiteCheckpointStore("runs/checkpoints.db")
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store)
Redis (distributed)
pip install "triage-agent[redis]"
import redis.asyncio as aioredis
from triage.checkpoint.redis import RedisCheckpointStore
client = aioredis.Redis.from_url("redis://localhost:6379")
store = RedisCheckpointStore(client)
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store)
Auto-checkpoint
Enable automatic checkpointing after every successful step:
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store, auto_checkpoint=True)
Checkpoints are always awaited before run() returns or any recovery action executes, so a ROLLBACK always has a checkpoint available.
Recovery context in your agent
Two callbacks are always injected, plus recovery context on retry:
async def my_agent(
task: str,
*,
record_step,
update_state,
_triage_hint=None,
_triage_subgoal=None,
_triage_state=None,
**kwargs,
):
# On rollback, _triage_state contains the state saved at the checkpoint
if _triage_state:
data = _triage_state["data"] # skip re-fetching, use restored state
else:
data = fetch_data(task)
record_step(Step(index=0, action="fetch", tool_output=data))
update_state({"data": data}) # saved into every auto_checkpoint
if _triage_hint:
print(f"Recovery hint: {_triage_hint}")
if _triage_subgoal:
task = _triage_subgoal
| Key | Set when |
|---|---|
record_step |
Always — injected on every call |
update_state |
Always — injected on every call |
_triage_hint |
RETRY, REPLAN, or ROLLBACK action |
_triage_subgoal |
RESUME action |
_triage_state |
ROLLBACK action, when checkpoint has non-empty state |
Attempt history
Strategies can inspect everything that was tried before they were called:
async def smart_strategy(ctx: triage.FailureContext) -> triage.RecoveryAction:
# ctx.attempt_history is a list of (FailureType, action_kind) tuples
replan_count = sum(1 for _, kind in ctx.attempt_history if kind == "replan")
if replan_count >= 2:
return triage.RecoveryAction.ESCALATE(message="Replanned twice, still failing.")
return triage.RecoveryAction.REPLAN(hint="Try a different approach.")
policy = triage.FailurePolicy(GOAL_DRIFT=smart_strategy)
attempt_history is empty on the first failure and grows by one entry per recovery attempt. Each entry is (failure_type, action_kind) where action_kind is one of "retry", "replan", "rollback", "resume", "escalate", "abort".
Handling escalation and abort
try:
result = await agent.run(task)
except triage.TriageEscalationError as exc:
# exc.context is a FailureContext with the full trajectory and failure type
print(f"Needs human review: {exc}")
print(f"Failure type: {exc.context.failure_type.value}")
print(f"Failed at step: {exc.context.critical_step_index}")
except triage.TriageAbortError as exc:
print(f"Hard stop: {exc}")
Custom classifier
Any class implementing classify(trajectory, task) -> FailureType satisfies the protocol:
from triage.classifier.base import Classifier
from triage.taxonomy import FailureType
from triage.trajectory import Trajectory
class MyClassifier:
def classify(self, trajectory: Trajectory, task: str) -> FailureType:
...
agent = triage.Agent(my_agent, policy=policy, classifier=MyClassifier())
Example: OpenAI tool-calling loop
See examples/raw_openai.py for a full working example. It deliberately triggers a WRONG_TOOL_CALLED failure on the first attempt and shows triage catching and recovering it automatically:
OPENAI_API_KEY=sk-... python examples/raw_openai.py
Expected output:
Task: What is 42 * 17?
[triage] wrong_tool_called detected at step 0
[triage] Dispatching: RecoveryAction.RETRY(hint='Re-run using only tools in the current manifest.', inject={'max_attempts': 3})
[triage] Attempt 1...
Result: 714
Project layout
triage/
taxonomy.py FailureType enum, Step, FailureContext
trajectory.py Trajectory (append / replay_from / last_n_steps)
checkpoint/
base.py Checkpoint, CheckpointStore protocol, serialization helpers
memory.py InMemoryCheckpointStore
sqlite.py SQLiteCheckpointStore (requires aiosqlite)
redis.py RedisCheckpointStore (requires redis[asyncio])
policy.py RecoveryAction (6 constructors), FailurePolicy
agent.py Agent class, TriageEscalationError, TriageAbortError, @agent decorator
classifier/
base.py Classifier protocol
rules.py RulesClassifier — 6 rules, sync, zero API calls
llm.py LLMClassifier — Anthropic or OpenAI-compatible backend
hybrid.py HybridClassifier — rules first, LLM fallback on UNKNOWN
strategies/
retry.py retry_with_tool_manifest(), backoff_and_retry()
replan.py replan(), resume_from_subgoal()
rollback.py rollback_to_checkpoint()
adapters/
langgraph.py wrap_langgraph() (requires langgraph)
crewai.py wrap_crewai() (requires crewai)
openai_agents.py wrap_openai_agents() (requires openai-agents)
langchain.py wrap_langchain() (requires langchain)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file triage_agent-0.4.0.tar.gz.
File metadata
- Download URL: triage_agent-0.4.0.tar.gz
- Upload date:
- Size: 109.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77224c6bacc81ebb8769f6efac9cc716ca6f12d7c10f288213a27a520c4c9437
|
|
| MD5 |
e2f6ed55ec5dc44056f7f7ff80a56746
|
|
| BLAKE2b-256 |
b49f005da961f7e1c99cff895a878c4708a392822cee515fa80dcfdca80b68fd
|
Provenance
The following attestation bundles were made for triage_agent-0.4.0.tar.gz:
Publisher:
publish.yml on mattekudacy/triage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
triage_agent-0.4.0.tar.gz -
Subject digest:
77224c6bacc81ebb8769f6efac9cc716ca6f12d7c10f288213a27a520c4c9437 - Sigstore transparency entry: 1523007320
- Sigstore integration time:
-
Permalink:
mattekudacy/triage@b6bbef15a28c5c29d08f5480dba0b214d46977cc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/mattekudacy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b6bbef15a28c5c29d08f5480dba0b214d46977cc -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file triage_agent-0.4.0-py3-none-any.whl.
File metadata
- Download URL: triage_agent-0.4.0-py3-none-any.whl
- Upload date:
- Size: 34.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebc2cce5ec781f2879eee9dddbc348d15896a294efa3d340628fefd57515c8c7
|
|
| MD5 |
20df6a734418a399c1fcd7a76f6affae
|
|
| BLAKE2b-256 |
5457a0a75481f8fe0dd98d390df183aecab302d8132e92f327cb77c24395bf7b
|
Provenance
The following attestation bundles were made for triage_agent-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on mattekudacy/triage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
triage_agent-0.4.0-py3-none-any.whl -
Subject digest:
ebc2cce5ec781f2879eee9dddbc348d15896a294efa3d340628fefd57515c8c7 - Sigstore transparency entry: 1523007409
- Sigstore integration time:
-
Permalink:
mattekudacy/triage@b6bbef15a28c5c29d08f5480dba0b214d46977cc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/mattekudacy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b6bbef15a28c5c29d08f5480dba0b214d46977cc -
Trigger Event:
workflow_dispatch
-
Statement type: