Agent Failure Discovery - Analyze agent transcripts to discover error patterns
Project description
AgentFailureDiscovery
Agent Failure Discovery - A framework for analyzing AI agent transcripts to automatically discover error patterns and build comprehensive error taxonomies.
Overview
This framework helps you identify, categorize, and track errors in AI agent behavior by analyzing interaction transcripts. It uses an LLM-driven discovery pipeline that:
- Analyzes agent transcripts against policy documents and ground truth data
- Discovers new error types when failures don't match existing patterns
- Builds error taxonomies dynamically as more transcripts are processed
- Tracks error evolution across your agent's development lifecycle
Features
- Automated Error Discovery: LLM identifies failures and proposes new error types when patterns are meaningfully different
- Dynamic Taxonomy Building: Error registry grows organically as new failure modes are discovered
- Policy-Aware Analysis: Evaluates agent behavior against your specific policy documents
- Flexible Dimensions: Supports both pre-defined and custom error dimensions
- Transparent Tracking: Full audit trail of when and where each error type was discovered
- JSON-Based: Simple JSON input/output format for easy integration
Installation
From PyPI (once published)
pip install AgentFailureDiscovery
From Source
git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e .
Quick Start
1. Set up your environment
Create a .env file with your Azure OpenAI credentials:
cp .env.example .env
Edit .env and add your credentials:
AZURE_OPENAI_API_KEY=your-api-key-here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-05-01-preview
2. Prepare your data
Create a JSON file with your tasks and agent simulations:
{
"tasks": [
{
"id": "task_001",
"description": "User requests account balance",
"expected_behavior": "Agent checks balance using get_balance tool"
}
],
"simulations": [
{
"task_id": "task_001",
"trial": 0,
"transcript": {
"turns": [
{"role": "user", "content": "What's my balance?"},
{"role": "assistant", "content": "Your balance is $500", "tool_calls": []}
]
},
"reward": 0.0
}
]
}
3. Run the analysis
errordiscovery data.json --policy policy.md --output-dir ./results
4. Review results
The framework generates:
results/registry.json- Error type taxonomy with all discovered patternsresults/run_output.json- Detailed analysis results for each transcript
Usage
Command Line Interface
errordiscovery <json_file> [OPTIONS]
Required Arguments:
json_file- Path to JSON file containing tasks and simulations
Optional Arguments:
--policy PATH- Path to policy file (markdown or text)--output-dir PATH- Output directory for results (default:./output)--model NAME- LLM model deployment name (default:gpt-5)--env-file PATH- Path to .env file (default:.env)--version- Show version and exit
Example:
errordiscovery agent_transcripts.json \
--policy my_policy.md \
--output-dir ./analysis_results \
--model gpt-5
Python API
from errordiscovery import run_discovery
from errordiscovery.utils import load_json_data, load_policy_file, save_registry
from dotenv import load_dotenv
# Load environment
load_dotenv()
# Load your data
tasks, simulations = load_json_data("data.json")
policy = load_policy_file("policy.md")
# Run discovery
output = run_discovery(
policy=policy,
tasks=tasks,
simulations=simulations,
model="gpt-5"
)
# Save results
save_registry(output["registry"], "registry.json")
print(output["registry_summary"])
Input Format
JSON Structure
Your input JSON should contain two main sections:
{
"tasks": [
{
"id": "unique_task_id",
"description": "task description",
"expected_behavior": "what should happen",
"constraints": "any specific constraints",
...
}
],
"simulations": [
{
"task_id": "unique_task_id",
"trial": 0,
"transcript": {
"turns": [
{"role": "user", "content": "user message"},
{"role": "assistant", "content": "agent response", "tool_calls": []}
]
},
"reward": 1.0
}
]
}
Policy File
A plain text or markdown file describing the rules and constraints your agent should follow:
# Agent Policy
## Tool Usage
- Agents must verify account ownership before accessing balance
- All financial queries require authentication
## Communication
- Be concise and factual
- Do not make subjective comments
Output Format
Registry JSON
{
"error_types": {
"DATA_HALLUCINATION": {
"definition": "Stated facts that contradict or are absent from tool output",
"dimension": "data_faithfulness",
"is_new": false,
"origin": null
},
"CUSTOM_ERROR_TYPE": {
"definition": "New error discovered during analysis",
"dimension": "new_dimension",
"is_new": true,
"origin": "task_042"
}
},
"dimensions": {
"data_faithfulness": {
"question": "Did the agent accurately report data from tool outputs?",
"fail_guidance": "Agent stated facts that contradict or don't exist in tool outputs",
"pass_guidance": "Rounding ($99.99→$100); correct summarization",
"is_new": false,
"origin": null
}
},
"discovery_log": [
{"task_id": "task_042", "name": "CUSTOM_ERROR_TYPE"}
]
}
Base Error Dimensions
The framework comes with 6 pre-seeded dimensions:
Interaction Dimensions
- user_intent_adherence - Did the agent honor preferences and correctly read user input?
- user_question_fulfillment - Did the agent answer the user's direct questions?
Integrity Dimensions
- policy_violation - Did the agent's actions and decisions follow policy?
- policy_faithfulness - Did the agent correctly state policy rules?
- data_faithfulness - Did the agent accurately report data from tool outputs?
- tool_efficiency - Were the agent's tool calls reasonable given available context?
New dimensions are discovered automatically when errors don't fit existing categories.
Advanced Usage
Custom Dimensions
Restrict analysis to specific dimensions:
from errordiscovery.registry import INTERACTION_DIMS, INTEGRITY_DIMS
output = run_discovery(
policy=policy,
tasks=tasks,
simulations=simulations,
applicable_dims=INTERACTION_DIMS # Only use interaction dimensions
)
Ignore Specific Behaviors
Tell the analyzer to ignore known quirks:
output = run_discovery(
policy=policy,
tasks=tasks,
simulations=simulations,
ignored_behaviors=[
"concurrent tool calls (agent making more than one tool call in a single message)",
"emoji usage in responses"
]
)
Development
Setup Development Environment
git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e ".[dev]"
Run Tests
pytest
Code Formatting
black src/
ruff check src/
How It Works
The error discovery pipeline:
- Initialize Registry - Start with base error types and dimensions
- Analyze Transcripts - For each transcript:
- LLM identifies all failures
- Maps failures to existing error types
- Proposes new types when failures are meaningfully different
- Records observations for borderline cases
- Update Registry - New error types and dimensions are added dynamically
- Track Evolution - Discovery log tracks when/where each type was found
The key innovation is the "relaxed gate" for new types: discoveries are made when failures have different root causes, policy impacts, or user harms - not just when they're superficially different.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
Apache License 2.0 - see LICENSE file for details.
Citation
If you use this framework in your research, please cite:
@software{agentfailurediscovery,
title = {AgentFailureDiscovery: Automated Error Discovery for AI Agents},
author = {Driouich, Ilias and Cao, Hongliu},
year = {2026},
url = {https://github.com/AmadeusITGroup/AgentFailureDiscovery}
}
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Acknowledgments
Built with inspiration from AI agent evaluation frameworks and error taxonomy research.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentfailurediscovery-0.1.0.tar.gz.
File metadata
- Download URL: agentfailurediscovery-0.1.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4eff91551a3a95de38a2708002b8f5631825681f9b9e7f4b98c7b377ce4def3
|
|
| MD5 |
d4969554498875c923146aadc8b25b93
|
|
| BLAKE2b-256 |
dc1e53088b1620e22ac4567c140b10b3c69da4043e330a3efb37a814765a1a3a
|
Provenance
The following attestation bundles were made for agentfailurediscovery-0.1.0.tar.gz:
Publisher:
release.yml on AmadeusITGroup/AgentFailureDiscovery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentfailurediscovery-0.1.0.tar.gz -
Subject digest:
d4eff91551a3a95de38a2708002b8f5631825681f9b9e7f4b98c7b377ce4def3 - Sigstore transparency entry: 1356823379
- Sigstore integration time:
-
Permalink:
AmadeusITGroup/AgentFailureDiscovery@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6 -
Branch / Tag:
refs/tags/0.2 - Owner: https://github.com/AmadeusITGroup
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file agentfailurediscovery-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentfailurediscovery-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d480dbeafac8c3ebaefb707725dd0ad73b9f596e091d8c3666494dd84a13f7ba
|
|
| MD5 |
dddc8ae304f66e9a2c9599a62549bc56
|
|
| BLAKE2b-256 |
32ebaa103a2bbae5449f7186d22f07f9357b542f2f010df5831f766afbcddbb2
|
Provenance
The following attestation bundles were made for agentfailurediscovery-0.1.0-py3-none-any.whl:
Publisher:
release.yml on AmadeusITGroup/AgentFailureDiscovery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentfailurediscovery-0.1.0-py3-none-any.whl -
Subject digest:
d480dbeafac8c3ebaefb707725dd0ad73b9f596e091d8c3666494dd84a13f7ba - Sigstore transparency entry: 1356823385
- Sigstore integration time:
-
Permalink:
AmadeusITGroup/AgentFailureDiscovery@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6 -
Branch / Tag:
refs/tags/0.2 - Owner: https://github.com/AmadeusITGroup
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6 -
Trigger Event:
release
-
Statement type: