Skip to main content

Agent Failure Discovery - Analyze agent transcripts to discover error patterns

Project description

AgentFailureDiscovery

Agent Failure Discovery - A framework for analyzing AI agent transcripts to automatically discover error patterns and build comprehensive error taxonomies.


Overview

This framework helps you identify, categorize, and track errors in AI agent behavior by analyzing interaction transcripts. It uses an LLM-driven discovery pipeline that:

  1. Analyzes agent transcripts against policy documents and ground truth data
  2. Discovers new error types when failures don't match existing patterns
  3. Builds error taxonomies dynamically as more transcripts are processed
  4. Tracks error evolution across your agent's development lifecycle

Features

  • Automated Error Discovery: LLM identifies failures and proposes new error types when patterns are meaningfully different
  • Dynamic Taxonomy Building: Error registry grows organically as new failure modes are discovered
  • Policy-Aware Analysis: Evaluates agent behavior against your specific policy documents
  • Flexible Dimensions: Supports both pre-defined and custom error dimensions
  • Transparent Tracking: Full audit trail of when and where each error type was discovered
  • JSON-Based: Simple JSON input/output format for easy integration

Installation

From PyPI (once published)

pip install AgentFailureDiscovery

From Source

git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e .

Quick Start

1. Set up your environment

Create a .env file with your Azure OpenAI credentials:

cp .env.example .env

Edit .env and add your credentials:

AZURE_OPENAI_API_KEY=your-api-key-here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-05-01-preview

2. Prepare your data

Create a JSON file with your tasks and agent simulations:

{
  "tasks": [
    {
      "id": "task_001",
      "description": "User requests account balance",
      "expected_behavior": "Agent checks balance using get_balance tool"
    }
  ],
  "simulations": [
    {
      "task_id": "task_001",
      "trial": 0,
      "transcript": {
        "turns": [
          {"role": "user", "content": "What's my balance?"},
          {"role": "assistant", "content": "Your balance is $500", "tool_calls": []}
        ]
      },
      "reward": 0.0
    }
  ]
}

3. Run the analysis

errordiscovery data.json --policy policy.md --output-dir ./results

4. Review results

The framework generates:

  • results/registry.json - Error type taxonomy with all discovered patterns
  • results/run_output.json - Detailed analysis results for each transcript

Usage

Command Line Interface

errordiscovery <json_file> [OPTIONS]

Required Arguments:

  • json_file - Path to JSON file containing tasks and simulations

Optional Arguments:

  • --policy PATH - Path to policy file (markdown or text)
  • --output-dir PATH - Output directory for results (default: ./output)
  • --model NAME - LLM model deployment name (default: gpt-5)
  • --env-file PATH - Path to .env file (default: .env)
  • --version - Show version and exit

Example:

errordiscovery agent_transcripts.json \
    --policy my_policy.md \
    --output-dir ./analysis_results \
    --model gpt-5

Python API

from errordiscovery import run_discovery
from errordiscovery.utils import load_json_data, load_policy_file, save_registry
from dotenv import load_dotenv

# Load environment
load_dotenv()

# Load your data
tasks, simulations = load_json_data("data.json")
policy = load_policy_file("policy.md")

# Run discovery
output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    model="gpt-5"
)

# Save results
save_registry(output["registry"], "registry.json")
print(output["registry_summary"])

Input Format

JSON Structure

Your input JSON should contain two main sections:

{
  "tasks": [
    {
      "id": "unique_task_id",
      "description": "task description",
      "expected_behavior": "what should happen",
      "constraints": "any specific constraints",
      ...
    }
  ],
  "simulations": [
    {
      "task_id": "unique_task_id",
      "trial": 0,
      "transcript": {
        "turns": [
          {"role": "user", "content": "user message"},
          {"role": "assistant", "content": "agent response", "tool_calls": []}
        ]
      },
      "reward": 1.0
    }
  ]
}

Policy File

A plain text or markdown file describing the rules and constraints your agent should follow:

# Agent Policy

## Tool Usage
- Agents must verify account ownership before accessing balance
- All financial queries require authentication

## Communication
- Be concise and factual
- Do not make subjective comments

Output Format

Registry JSON

{
  "error_types": {
    "DATA_HALLUCINATION": {
      "definition": "Stated facts that contradict or are absent from tool output",
      "dimension": "data_faithfulness",
      "is_new": false,
      "origin": null
    },
    "CUSTOM_ERROR_TYPE": {
      "definition": "New error discovered during analysis",
      "dimension": "new_dimension",
      "is_new": true,
      "origin": "task_042"
    }
  },
  "dimensions": {
    "data_faithfulness": {
      "question": "Did the agent accurately report data from tool outputs?",
      "fail_guidance": "Agent stated facts that contradict or don't exist in tool outputs",
      "pass_guidance": "Rounding ($99.99→$100); correct summarization",
      "is_new": false,
      "origin": null
    }
  },
  "discovery_log": [
    {"task_id": "task_042", "name": "CUSTOM_ERROR_TYPE"}
  ]
}

Base Error Dimensions

The framework comes with 6 pre-seeded dimensions:

Interaction Dimensions

  • user_intent_adherence - Did the agent honor preferences and correctly read user input?
  • user_question_fulfillment - Did the agent answer the user's direct questions?

Integrity Dimensions

  • policy_violation - Did the agent's actions and decisions follow policy?
  • policy_faithfulness - Did the agent correctly state policy rules?
  • data_faithfulness - Did the agent accurately report data from tool outputs?
  • tool_efficiency - Were the agent's tool calls reasonable given available context?

New dimensions are discovered automatically when errors don't fit existing categories.


Advanced Usage

Custom Dimensions

Restrict analysis to specific dimensions:

from errordiscovery.registry import INTERACTION_DIMS, INTEGRITY_DIMS

output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    applicable_dims=INTERACTION_DIMS  # Only use interaction dimensions
)

Ignore Specific Behaviors

Tell the analyzer to ignore known quirks:

output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    ignored_behaviors=[
        "concurrent tool calls (agent making more than one tool call in a single message)",
        "emoji usage in responses"
    ]
)

Development

Setup Development Environment

git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e ".[dev]"

Run Tests

pytest

Code Formatting

black src/
ruff check src/

How It Works

The error discovery pipeline:

  1. Initialize Registry - Start with base error types and dimensions
  2. Analyze Transcripts - For each transcript:
    • LLM identifies all failures
    • Maps failures to existing error types
    • Proposes new types when failures are meaningfully different
    • Records observations for borderline cases
  3. Update Registry - New error types and dimensions are added dynamically
  4. Track Evolution - Discovery log tracks when/where each type was found

The key innovation is the "relaxed gate" for new types: discoveries are made when failures have different root causes, policy impacts, or user harms - not just when they're superficially different.


Contributing

Contributions are welcome! Please open an issue or submit a pull request.


License

Apache License 2.0 - see LICENSE file for details.


Citation

If you use this framework in your research, please cite:

@software{agentfailurediscovery,
  title = {AgentFailureDiscovery: Automated Error Discovery for AI Agents},
  author = {Driouich, Ilias and Cao, Hongliu},
  year = {2026},
  url = {https://github.com/AmadeusITGroup/AgentFailureDiscovery}
}

Support


Acknowledgments

Built with inspiration from AI agent evaluation frameworks and error taxonomy research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentfailurediscovery-0.1.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentfailurediscovery-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file agentfailurediscovery-0.1.0.tar.gz.

File metadata

  • Download URL: agentfailurediscovery-0.1.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentfailurediscovery-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d4eff91551a3a95de38a2708002b8f5631825681f9b9e7f4b98c7b377ce4def3
MD5 d4969554498875c923146aadc8b25b93
BLAKE2b-256 dc1e53088b1620e22ac4567c140b10b3c69da4043e330a3efb37a814765a1a3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfailurediscovery-0.1.0.tar.gz:

Publisher: release.yml on AmadeusITGroup/AgentFailureDiscovery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentfailurediscovery-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentfailurediscovery-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d480dbeafac8c3ebaefb707725dd0ad73b9f596e091d8c3666494dd84a13f7ba
MD5 dddc8ae304f66e9a2c9599a62549bc56
BLAKE2b-256 32ebaa103a2bbae5449f7186d22f07f9357b542f2f010df5831f766afbcddbb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfailurediscovery-0.1.0-py3-none-any.whl:

Publisher: release.yml on AmadeusITGroup/AgentFailureDiscovery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page