Agent Failure Discovery - Analyze agent transcripts to discover error patterns

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

AgentFailureDiscovery

Agent Failure Discovery - A framework for analyzing AI agent transcripts to automatically discover error patterns and build comprehensive error taxonomies.

Overview

This framework helps you identify, categorize, and track errors in AI agent behavior by analyzing interaction transcripts. It uses an LLM-driven discovery pipeline that:

Analyzes agent transcripts against policy documents and ground truth data
Discovers new error types when failures don't match existing patterns
Builds error taxonomies dynamically as more transcripts are processed
Tracks error evolution across your agent's development lifecycle

Features

Automated Error Discovery: LLM identifies failures and proposes new error types when patterns are meaningfully different
Dynamic Taxonomy Building: Error registry grows organically as new failure modes are discovered
Policy-Aware Analysis: Evaluates agent behavior against your specific policy documents
Flexible Dimensions: Supports both pre-defined and custom error dimensions
Transparent Tracking: Full audit trail of when and where each error type was discovered
JSON-Based: Simple JSON input/output format for easy integration

Installation

From PyPI (once published)

pip install AgentFailureDiscovery

From Source

git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e .

Quick Start

1. Set up your environment

Create a .env file with your Azure OpenAI credentials:

cp .env.example .env

Edit .env and add your credentials:

AZURE_OPENAI_API_KEY=your-api-key-here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-05-01-preview

2. Prepare your data

Create a JSON file with your tasks and agent simulations:

{
  "tasks": [
    {
      "id": "task_001",
      "description": "User requests account balance",
      "expected_behavior": "Agent checks balance using get_balance tool"
    }
  ],
  "simulations": [
    {
      "task_id": "task_001",
      "trial": 0,
      "transcript": {
        "turns": [
          {"role": "user", "content": "What's my balance?"},
          {"role": "assistant", "content": "Your balance is $500", "tool_calls": []}
        ]
      },
      "reward": 0.0
    }
  ]
}

3. Run the analysis

errordiscovery data.json --policy policy.md --output-dir ./results

4. Review results

The framework generates:

results/registry.json - Error type taxonomy with all discovered patterns
results/run_output.json - Detailed analysis results for each transcript

Usage

Command Line Interface

errordiscovery <json_file> [OPTIONS]

Required Arguments:

json_file - Path to JSON file containing tasks and simulations

Optional Arguments:

--policy PATH - Path to policy file (markdown or text)
--output-dir PATH - Output directory for results (default: ./output)
--model NAME - LLM model deployment name (default: gpt-5)
--env-file PATH - Path to .env file (default: .env)
--version - Show version and exit

Example:

errordiscovery agent_transcripts.json \
    --policy my_policy.md \
    --output-dir ./analysis_results \
    --model gpt-5

Python API

from errordiscovery import run_discovery
from errordiscovery.utils import load_json_data, load_policy_file, save_registry
from dotenv import load_dotenv

# Load environment
load_dotenv()

# Load your data
tasks, simulations = load_json_data("data.json")
policy = load_policy_file("policy.md")

# Run discovery
output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    model="gpt-5"
)

# Save results
save_registry(output["registry"], "registry.json")
print(output["registry_summary"])

Input Format

JSON Structure

Your input JSON should contain two main sections:

{
  "tasks": [
    {
      "id": "unique_task_id",
      "description": "task description",
      "expected_behavior": "what should happen",
      "constraints": "any specific constraints",
      ...
    }
  ],
  "simulations": [
    {
      "task_id": "unique_task_id",
      "trial": 0,
      "transcript": {
        "turns": [
          {"role": "user", "content": "user message"},
          {"role": "assistant", "content": "agent response", "tool_calls": []}
        ]
      },
      "reward": 1.0
    }
  ]
}

Policy File

A plain text or markdown file describing the rules and constraints your agent should follow:

# Agent Policy

## Tool Usage
- Agents must verify account ownership before accessing balance
- All financial queries require authentication

## Communication
- Be concise and factual
- Do not make subjective comments

Output Format

Registry JSON

{
  "error_types": {
    "DATA_HALLUCINATION": {
      "definition": "Stated facts that contradict or are absent from tool output",
      "dimension": "data_faithfulness",
      "is_new": false,
      "origin": null
    },
    "CUSTOM_ERROR_TYPE": {
      "definition": "New error discovered during analysis",
      "dimension": "new_dimension",
      "is_new": true,
      "origin": "task_042"
    }
  },
  "dimensions": {
    "data_faithfulness": {
      "question": "Did the agent accurately report data from tool outputs?",
      "fail_guidance": "Agent stated facts that contradict or don't exist in tool outputs",
      "pass_guidance": "Rounding ($99.99→$100); correct summarization",
      "is_new": false,
      "origin": null
    }
  },
  "discovery_log": [
    {"task_id": "task_042", "name": "CUSTOM_ERROR_TYPE"}
  ]
}

Base Error Dimensions

The framework comes with 6 pre-seeded dimensions:

Interaction Dimensions

user_intent_adherence - Did the agent honor preferences and correctly read user input?
user_question_fulfillment - Did the agent answer the user's direct questions?

Integrity Dimensions

policy_violation - Did the agent's actions and decisions follow policy?
policy_faithfulness - Did the agent correctly state policy rules?
data_faithfulness - Did the agent accurately report data from tool outputs?
tool_efficiency - Were the agent's tool calls reasonable given available context?

New dimensions are discovered automatically when errors don't fit existing categories.

Advanced Usage

Custom Dimensions

Restrict analysis to specific dimensions:

from errordiscovery.registry import INTERACTION_DIMS, INTEGRITY_DIMS

output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    applicable_dims=INTERACTION_DIMS  # Only use interaction dimensions
)

Ignore Specific Behaviors

Tell the analyzer to ignore known quirks:

output = run_discovery(
    policy=policy,
    tasks=tasks,
    simulations=simulations,
    ignored_behaviors=[
        "concurrent tool calls (agent making more than one tool call in a single message)",
        "emoji usage in responses"
    ]
)

Development

Setup Development Environment

git clone https://github.com/AmadeusITGroup/AgentFailureDiscovery.git
cd AgentFailureDiscovery
pip install -e ".[dev]"

Run Tests

pytest

Code Formatting

black src/
ruff check src/

How It Works

The error discovery pipeline:

Initialize Registry - Start with base error types and dimensions
Analyze Transcripts - For each transcript:
- LLM identifies all failures
- Maps failures to existing error types
- Proposes new types when failures are meaningfully different
- Records observations for borderline cases
Update Registry - New error types and dimensions are added dynamically
Track Evolution - Discovery log tracks when/where each type was found

The key innovation is the "relaxed gate" for new types: discoveries are made when failures have different root causes, policy impacts, or user harms - not just when they're superficially different.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

Apache License 2.0 - see LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@software{agentfailurediscovery,
  title = {AgentFailureDiscovery: Automated Error Discovery for AI Agents},
  author = {Driouich, Ilias and Cao, Hongliu},
  year = {2026},
  url = {https://github.com/AmadeusITGroup/AgentFailureDiscovery}
}

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Acknowledgments

Built with inspiration from AI agent evaluation frameworks and error taxonomy research.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

idriouic

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentfailurediscovery-0.1.0.tar.gz (18.2 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentfailurediscovery-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file agentfailurediscovery-0.1.0.tar.gz.

File metadata

Download URL: agentfailurediscovery-0.1.0.tar.gz
Upload date: Apr 22, 2026
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentfailurediscovery-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d4eff91551a3a95de38a2708002b8f5631825681f9b9e7f4b98c7b377ce4def3`
MD5	`d4969554498875c923146aadc8b25b93`
BLAKE2b-256	`dc1e53088b1620e22ac4567c140b10b3c69da4043e330a3efb37a814765a1a3a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfailurediscovery-0.1.0.tar.gz:

Publisher: release.yml on AmadeusITGroup/AgentFailureDiscovery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentfailurediscovery-0.1.0.tar.gz
- Subject digest: d4eff91551a3a95de38a2708002b8f5631825681f9b9e7f4b98c7b377ce4def3
- Sigstore transparency entry: 1356823379
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: AmadeusITGroup/AgentFailureDiscovery@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6
- Branch / Tag: refs/tags/0.2
- Owner: https://github.com/AmadeusITGroup
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6
- Trigger Event: release

File details

Details for the file agentfailurediscovery-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentfailurediscovery-0.1.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentfailurediscovery-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d480dbeafac8c3ebaefb707725dd0ad73b9f596e091d8c3666494dd84a13f7ba`
MD5	`dddc8ae304f66e9a2c9599a62549bc56`
BLAKE2b-256	`32ebaa103a2bbae5449f7186d22f07f9357b542f2f010df5831f766afbcddbb2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentfailurediscovery-0.1.0-py3-none-any.whl:

Publisher: release.yml on AmadeusITGroup/AgentFailureDiscovery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentfailurediscovery-0.1.0-py3-none-any.whl
- Subject digest: d480dbeafac8c3ebaefb707725dd0ad73b9f596e091d8c3666494dd84a13f7ba
- Sigstore transparency entry: 1356823385
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: AmadeusITGroup/AgentFailureDiscovery@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6
- Branch / Tag: refs/tags/0.2
- Owner: https://github.com/AmadeusITGroup
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8d2addb7aa458a15ee1e97fd2adf6e33c587c4c6
- Trigger Event: release

AgentFailureDiscovery 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AgentFailureDiscovery

Overview

Features

Installation

From PyPI (once published)

From Source

Quick Start

1. Set up your environment

2. Prepare your data

3. Run the analysis

4. Review results

Usage

Command Line Interface

Python API

Input Format

JSON Structure

Policy File

Output Format

Registry JSON

Base Error Dimensions

Interaction Dimensions

Integrity Dimensions

Advanced Usage

Custom Dimensions

Ignore Specific Behaviors

Development

Setup Development Environment

Run Tests

Code Formatting

How It Works

Contributing

License

Citation

Support

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance