AgentRx – Agentic trajectory analysis pipeline for invariant checking and LLM-as-a-Judge evaluation

Project description

AgentRx 🩺

Diagnosing AI Agent Failures from Execution Trajectories

AI agents often fail in ways that are difficult to localize — executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. AgentRx is an automated, domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints (invariants), evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence. An LLM-based judge uses this log to localize the critical step and classify the failure into a grounded 10-category taxonomy.

AgentRx improves step localization and failure attribution over existing baselines across three domains: structured API workflows (Tau-bench), incident management (Flash), and open-ended web/file tasks (Magentic-One).

Raw logs ──▶ Trajectory IR ──▶ Invariants ──▶ Checker ──▶ Judge ──▶ Reports

Quick Start

# Setup
python -m venv .venv
.venv/Scripts/activate          # Windows; use `source .venv/bin/activate` on Linux/Mac
pip install -e .                # installs agentrx + all dependencies
cp .env.example .env            # Fill in your Azure or TRAPI endpoint details

# Local dev: skip ManagedIdentity IMDS probe
export AZURE_TOKEN_CREDENTIALS=dev  # or add to your .env file

# Run the full pipeline end-to-end
python run.py trajectory.json

# Specify domain explicitly
python run.py trajectory.json --domain tau

You can also install directly from GitHub without cloning:

pip install git+https://github.com/microsoft/AgentRx.git

All outputs are saved to runs/<run_name>/.

Step-by-Step Usage

You can run each stage individually and inspect the results between stages:

# 1. Normalize raw logs into Trajectory IR
python run.py trajectory.json --stage ir --run-name my_run

# 2. Generate static invariants
python run.py trajectory.json --stage static --run-dir runs/my_run

# 3. Generate dynamic (per-step) invariants
python run.py trajectory.json --stage dynamic --run-dir runs/my_run

# 4. Check all invariants against the trajectory
python run.py trajectory.json --stage check --run-dir runs/my_run

# 5. Run LLM judge for root-cause classification
python run.py trajectory.json --stage judge --run-dir runs/my_run

# 6. Generate report plots
python run.py trajectory.json --stage report --run-dir runs/my_run

Pipeline Stages

#	Stage	Output
1	IR — Normalize raw logs into canonical Trajectory IR	`trajectory_ir.json`
2	Static — Generate policy/tool/structure invariants	`static_invariants.json`
3	Dynamic — Generate per-step context-aware invariants	`dynamic_invariants/`
4	Check — Evaluate invariants, record violations	`checker_results/`
5	Judge — LLM classifies root-cause failure (10-category taxonomy)	`judge_output/`
6	Report — Failure frequency plots	`plots/`

Directory Structure

AgentRx/
├── run.py                       # CLI entry point (backward-compatible)
├── pyproject.toml               # Package configuration (pip install -e .)
├── requirements.txt             # Python dependencies
├── agentrx/                     # Main package
│   ├── cli.py                   # Console script entry point
│   ├── ir/                      # Trajectory IR normalization
│   ├── invariants/              # Invariant generation & checking
│   ├── judge/                   # LLM-as-a-Judge evaluation
│   ├── llm_clients/             # Azure OpenAI & TRAPI clients
│   ├── pipeline/                # Config (globals.py), utilities
│   └── reports/                 # Analysis & visualization
├── data/                        # Domain policies, tool schemas, ground truth
├── trajectories/                # Sample trajectories (tau, magentic, test)
└── runs/                        # Pipeline outputs (one folder per run)

Supported Domains

Domain	Flag	Description
tau	`--domain tau`	Tau-bench retail customer service
magentic	`--domain magentic`	Magentic-One multi-agent
flash	`--domain flash`	Flash/orchestrator incident traces
(auto)	(default)	Auto-detected; unknown formats use LLM-based IR fallback

Configuration

LLM settings are loaded from environment variables (via .env or shell):

Copy the template and fill in your values:

cp .env.example .env

# Azure OpenAI (default endpoint)
AGENT_VERIFY_ENDPOINT=                # e.g., "https://my-resource.openai.azure.com/"
AGENT_VERIFY_DEPLOYMENT=              # e.g., "gpt-5"

# TRAPI (Microsoft Research internal, use --endpoint trapi)
AGENT_VERIFY_TRAPI_INSTANCE=          # e.g., "my-instance/my-pool"
AGENT_VERIFY_TRAPI_DEPLOYMENT_NAME=   # e.g., "my-deployment-name"
SCOPE=                                # Azure AD scope for TRAPI

Both endpoints use Azure AD token-based auth (az login or Managed Identity).

Note: TRAPI is a Microsoft Research internal endpoint. External teams should use --endpoint azure (default).

Failure Taxonomy

#	Category	Description
1	Instruction/Plan Adherence Failure	Skips steps or adds unnecessary actions
2	Invention of New Information	Fabricates or omits ungrounded facts
3	Invalid Invocation	Malformed tool call (wrong args/types/schema)
4	Misinterpretation of Tool Output	Incorrect reasoning about tool results
5	Intent-Plan Misalignment	Pursues wrong objective
6	Underspecified User Intent	Missing information to proceed
7	Intent Not Supported	Action can't be performed with available tools
8	Guardrails Triggered	Blocked by safety/RAI/access policies
9	System Failure	Infra errors (timeouts, unreachable endpoints)
10	Inconclusive	Insufficient evidence to classify

Running Individual Modules

Each module can also be run standalone:

Static Invariant Generator — generate policy/tool invariants:

python agentrx/invariants/static_invariant_generator.py --input-path trajectory.json --domain tau

Dynamic Invariant Generator — generate per-step context-aware invariants:

python agentrx/invariants/dynamic_invariant_generator.py --input-path trajectory.json --domain tau --mode stepbystep

Checker — evaluate invariants against a trajectory:

python agentrx/invariants/checker.py --input-path trajectory.json --static-invariants static_inv.json --dynamic-invariants-dir dyn_inv/

Judge — run LLM-as-a-Judge classification:

python agentrx/judge/judge.py --domain tau --log_file trajectory.json --mode combined

Third-Party Code

This project uses the following third-party open source packages (installed via requirements.txt):

openai — OpenAI Python client (MIT License)
azure-identity / azure-core — Azure SDK authentication (MIT License)
matplotlib — Plotting and visualization (PSF-based License)
tiktoken — Token counting (MIT License)
httpx — HTTP client (BSD License)

See requirements.txt for the full list of dependencies.

Troubleshooting

`DefaultAzureCredential` timeout on local machines

The Azure SDK's DefaultAzureCredential tries ManagedIdentityCredential before AzureCliCredential. On a local dev machine this probes the IMDS endpoint which doesn't exist locally, causing a ~5-10s timeout before falling back. This is expected behavior — the probe is how DefaultAzureCredential detects the hosting environment.

Fix: Set the AZURE_TOKEN_CREDENTIALS environment variable to dev to exclude deployed-service credentials (e.g. ManagedIdentityCredential, WorkloadIdentityCredential) from the chain, so DefaultAzureCredential skips straight to developer-tool credentials like AzureCliCredential:

# PowerShell
$env:AZURE_TOKEN_CREDENTIALS = "dev"

# Bash / Linux / macOS
export AZURE_TOKEN_CREDENTIALS=dev

Or add AZURE_TOKEN_CREDENTIALS=dev to your .env file.

Requires azure-identity >= 1.23.0. See Exclude a credential type category for details.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

Please see CONTRIBUTING.md for detailed guidelines.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

License

Licensed under the MIT license.

Citation

If you use AgentRx, please cite:

@article{barke2026agentrx,
  title={AgentRx: Diagnosing AI Agent Failures from Execution Trajectories},
  author={Barke, Shraddha and Goyal, Arnav and Khare, Alind and Singh, Avaljot and Nath, Suman and Bansal, Chetan},
  journal={arXiv preprint arXiv:2602.02475},
  year={2026}
}

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrx-0.1.0.tar.gz (306.9 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentrx-0.1.0-py3-none-any.whl (330.4 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file agentrx-0.1.0.tar.gz.

File metadata

Download URL: agentrx-0.1.0.tar.gz
Upload date: Apr 28, 2026
Size: 306.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for agentrx-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e2e45f17d9a65f0eda4a71eb551e26dd0ab8615d46dec742ec438e074cf7e8ae`
MD5	`d2a733b1f02fdbe80d81be5592865a44`
BLAKE2b-256	`405150ce56aacbb6a637f6552cffc8fdb3f708e23fa37ca8dcec15a5155a7318`

See more details on using hashes here.

File details

Details for the file agentrx-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentrx-0.1.0-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 330.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for agentrx-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`833b24caf3d540340a05e4880d6a98619fe4433a08906f00b0e639310bca5667`
MD5	`4aaf919883cbf5611a37a81f97c72f7e`
BLAKE2b-256	`ebd13083b29b9433195908fc59405726d992738afdbecc77fe20d369b73b4bc5`

See more details on using hashes here.

agentrx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AgentRx 🩺

Quick Start

Step-by-Step Usage

Pipeline Stages

Directory Structure

Supported Domains

Configuration

Failure Taxonomy

Running Individual Modules

Third-Party Code

Troubleshooting

`DefaultAzureCredential` timeout on local machines

Contributing

Trademarks

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

agentrx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AgentRx 🩺

Quick Start

Step-by-Step Usage

Pipeline Stages

Directory Structure

Supported Domains

Configuration

Failure Taxonomy

Running Individual Modules

Third-Party Code

Troubleshooting

DefaultAzureCredential timeout on local machines

Contributing

Trademarks

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DefaultAzureCredential` timeout on local machines