ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

These details have not been verified by PyPI

Project links

Project description

ARC-Eval: Debug, Evaluate, and Improve AI Agents

CLI-native, framework-agnostic evaluation for AI agents. Real-world reliability, compliance, and risk - improve performance with every run.

ARC-Eval tests any agent—regardless of framework—against 378 enterprise-grade scenarios in finance, security, and machine learning. Instantly spot risks like data leaks, bias, or compliance gaps. With four simple CLI workflows, ARC-Eval delivers actionable insights, continuous improvement, and audit-ready reports—no code changes required.

It's built to be agent-agnostic, meaning you can bring your own agent (BYOA) regardless of the framework (LangChain, OpenAI, Google, Agno, etc.) and get actionable insights with minimal setup.

Quick Start
Core Workflows
Key Features & Dashboard
Flexible Input & Auto-Detection
Scenario Libraries & Regulations
How It Works
Examples & Integrations
Advanced Usage
Tips & Troubleshooting
License & Links

⚡ Quick Start (2 minutes)

💡 Pro Tip: Use --quick-start for an instant, no-setup demo. See Flexible Input & Auto-Detection for all ingestion options.

# 1. Install ARC-Eval (Python 3.9+ required)
pip install arc-eval

# 2. Try it instantly with sample data (no local files needed!)
arc-eval compliance --domain finance --quick-start

# 3. See all available commands and options
arc-eval --help

Next Steps with Your Agent

⚠️ Important: For agent-as-judge evaluation, set your API key:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
See Flexible Input & Auto-Detection for details.

# For agent-as-judge evaluation (optional but highly recommended for deeper insights)
export ANTHROPIC_API_KEY="your-anthropic-api-key" # Or your preferred LLM provider API key

# For any agent framework - just point to your output file
arc-eval debug --input your_agent_trace.json
arc-eval compliance --domain security --input your_agent_outputs.json
arc-eval improve --from-evaluation latest

# Or get the complete picture in one command
arc-eval analyze --input your_agent_outputs.json --domain finance

# Get guided help and explore workflows anytime from the interactive menu
arc-eval

Core Workflows

See Scenario Libraries & Regulations for full coverage details.
See Flexible Input & Auto-Detection for all ingestion options.

Debug: "Why is my agent failing?"

arc-eval debug --input your_agent_trace.json

Compliance: "Does my agent meet requirements?"

arc-eval compliance --domain finance --input your_agent_outputs.json

# Or try it instantly with sample data (no local files or API keys needed!)
arc-eval compliance --domain security --quick-start

Improve: "How do I make it better?"

arc-eval improve --from-evaluation latest # Uses insights from your last evaluation

Analyze: "Give me the complete picture"

arc-eval analyze --input your_agent_outputs.json --domain finance

This command automatically runs the debug process, then the compliance checks, and finally presents the interactive menu for next steps.

Key Features & Dashboard

Agent-as-a-Judge: ARC-Eval uses LLMs as domain-specific judges to evaluate agent outputs, provide continuous feedback, and drive improvement. This is implemented in agent_eval/evaluation/judges/, with feedback and retraining handled by agent_eval/analysis/self_improvement.py and adaptive scenario generation in agent_eval/core/scenario_bank.py.

Interactive Menus

After each workflow (like debug or compliance), see an interactive menu guiding you to logical next steps, making it easy to navigate the platform's capabilities.

🔍 What would you like to do?
════════════════════════════════════════

  [1]  Run compliance check on these outputs      (Recommended)
  [2]  Ask questions about failures               (Interactive Mode)
  [3]  Export debug report                        (PDF/CSV/JSON)
  [4]  View learning dashboard & submit patterns  (Improve ARC-Eval)

Learning Dashboard

The system tracks failure patterns and improvements over time, providing valuable insights:

Pattern Library: Captures and catalogs recurring failure patterns from your agent's runs.
Fix Catalog: Suggests specific code fixes or configuration changes for common, identified issues.
Performance Delta: Clearly shows improvement metrics (e.g., compliance pass rate increasing from 73% → 91% after applying fixes).

Versatile Export Options

Easily share or archive your findings:

PDF: Professional, auditable reports ideal for compliance teams and stakeholder reviews.
CSV: Raw data suitable for spreadsheet analysis or custom charting.
JSON: Structured data perfect for integration with monitoring systems, CI/CD pipelines, or other internal tools.

Flexible Input & Auto-Detection

ARC-Eval is designed to seamlessly fit into your existing workflows with flexible input methods and intelligent format detection.

Multiple Ways to Provide Agent Outputs

You can feed your agent's traces and outputs to ARC-Eval using several convenient methods:

# 1. Direct file input (most common)
arc-eval compliance --domain finance --input your_agent_traces.json

# 2. Auto-scan current directory for JSON files
# Ideal when you have multiple trace files in a folder.
arc-eval compliance --domain finance --folder-scan

# 3. Paste traces directly from your clipboard
# Useful for quick, one-off evaluations. (Requires pyperclip: pip install pyperclip)
arc-eval compliance --domain finance --input clipboard

# 4. Instant demo with built-in sample data (no files needed!)
arc-eval compliance --domain finance --quick-start

# 5. For automation/CI-CD (skips interactive prompts)
arc-eval compliance --domain finance --input your_agent_traces.json --no-interactive

Performance Optimization

For faster evaluation, include scenario_id in your agent outputs to limit evaluation to specific scenarios:

{
  "output": "Transaction approved after KYC verification",
  "scenario_id": "fin_001"
}

Performance Tips:

✅ Include scenario_id: Limits evaluation to specific scenarios (10x faster)
✅ Use --no-interactive: Essential for automation and CI/CD
✅ Use --quick-start: Instant demo with built-in sample data
✅ Batch processing: Automatically enabled for 5+ scenarios (50% cost savings)

Automatic Format Detection

No need to reformat your agent logs. ARC-Eval automatically detects and parses outputs from many common agent frameworks and LLM API responses. Just point ARC-Eval to your data, and it will handle the rest.

Examples of auto-detected formats:

// Simple, generic format (works with any custom agent)
{
  "output": "Transaction approved for account X9876.",
  "scenario_id": "fin_001", // Optional: for faster evaluation (limits to specific scenarios)
  "error": null, // Optional: include if an error occurred
  "metadata": {"user_id": "user123", "timestamp": "2024-05-27T10:30:00Z"} // Optional metadata
}

// OpenAI / Anthropic API style logs (and similar LLM provider formats)
{
  "id": "msg_abc123",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "tool_calls": [ // If your agent uses tools
          {"id": "call_def456", "type": "function", "function": {"name": "get_capital_city", "arguments": "{\"country\": \"France\"}"}}
        ]
      }
    }
  ],
  "usage": {"prompt_tokens": 50, "completion_tokens": 10}
}

// LangChain / CrewAI / LangGraph style traces (capturing intermediate steps)
{
  "input": "What is the weather in London?",
  "intermediate_steps": [
    [
      {
        "tool": "weather_api",
        "tool_input": "London",
        "log": "Invoking weather_api with London\n"
      },
      "Rainy, 10°C"
    ]
  ],
  "output": "The weather in London is Rainy, 10°C.",
  "metadata": {"run_id": "run_789"}
}

ARC-Eval intelligently extracts the core agent response, tool calls, and relevant metadata for evaluation. For adding custom parsers, see agent_eval/core/parser_registry.py.

Scenario Libraries & Regulations

Show Scenario Libraries & Regulations

Comprehensive Enterprise Test Suite

ARC-Eval provides 378 enterprise scenarios across three critical domains:

Domain	Scenarios	Key Regulations	Use Cases
Finance	110 scenarios	SOX, KYC/AML, PCI-DSS, GDPR, EU AI Act	Financial reporting, fraud detection, loan processing
Security	120 scenarios	OWASP LLM Top 10, NIST AI RMF, ISO 27001	Prompt injection, data leakage, model theft
ML/AI	148 scenarios	EU AI Act, IEEE P7000, Model Cards	Bias detection, explainability, model governance

Real-World Enterprise Scenarios

SOX Compliance: Detect earnings manipulation in SEC filings
AML/KYC: Identify synthetic identities and money laundering patterns
EU AI Act: Assess high-risk AI system compliance and bias detection
OWASP LLM: Test for prompt injection, training data poisoning, model theft
Data Privacy: GDPR, CCPA compliance validation and PII detection

Why This Matters: While many evaluation platforms focus on "helpfulness" and "harmlessness," ARC-Eval specializes in regulatory compliance and enterprise risk scenarios that can result in significant fines or regulatory action.

How It Works

graph LR
    A[Agent Output] --> B[Debug]
    B --> C[Compliance]
    C --> D[Dashboard & Report]
    D --> E[Improve]
    E --> F[Re-evaluate]
    F --> B

The Arc Loop: ARC-Eval learns from every failure to build smarter, more reliable agents.

Debug: You provide your agent's output (trace). ARC-Eval finds what's broken (errors, inefficiencies) and can suggest running a compliance check for deeper analysis.
Compliance: Your agent is measured against hundreds of real-world scenarios. The results populate the Learning Dashboard and generate a compliance report.
Dashboard & Report: Track your agent's learning progress, identify recurring patterns, and see compliance gaps. This insight guides the improvement plan.
Improve: Based on the findings, ARC-Eval generates prioritized fixes and an improvement plan, then prompts re-evaluation.
Re-evaluate: Test the suggested improvements. The new performance data feeds back into the loop, enabling continuous refinement.

📖 Complete Implementation Guide: See Core Product Loops Documentation for detailed step-by-step instructions on implementing The Arc Loop and Data Flywheel in your development workflow.

🔄 ARC-Eval Data Flywheel: Complete End-to-End Flow

📊 Core Architecture Overview

  Static Domain Knowledge → Dynamic Learning → Performance Analysis → Adaptive Improvement
          ↓                      ↓                    ↓                      ↓
     finance.yaml          ScenarioBank      SelfImprovementEngine    FlywheelExperiment
     (110 scenarios)    (pattern learning)   (performance tracking)    (ACL curriculum)
          ↑                      ↑                    ↑                      ↑
          └──────────────── Continuous Feedback Loop ────────────────────────┘

Examples & Integrations

Show Examples & Integrations

📚 Complete Documentation: See docs/ for comprehensive guides including:

🔄 Core Product Loops - The Arc Loop & Data Flywheel (Essential!)

Quick Start Guide - Get running in 5 minutes

Workflows Guide - Debug, compliance, and improvement workflows

Prediction System - Hybrid reliability prediction framework

Framework Integration - Support for 10+ agent frameworks

API Reference - Complete Python SDK documentation

Testing Guide - Comprehensive testing methodology and validation

Troubleshooting Guide - Common issues, solutions, and optimization

Enterprise Integration - CI/CD pipeline integration

🔧 Practical Examples: Explore examples/ for:

Framework-specific integration examples

CI/CD pipeline templates

Sample agent outputs and configurations

Prediction testing and validation

Advanced Usage

Show Advanced Usage (SDK, CI/CD, etc.)

Python SDK for Programmatic Evaluation

from agent_eval.core import EvaluationEngine, AgentOutput
from agent_eval.core.types import EvaluationResult, EvaluationSummary

# Example agent outputs (replace with your actual agent data)
agent_data = [
    {"output": "The transaction is approved.", "metadata": {"scenario": "finance_scenario_1"}},
    {"output": "Access denied due to security policy.", "metadata": {"scenario": "security_scenario_3"}}
]
agent_outputs = [AgentOutput.from_raw(data) for data in agent_data]

# Initialize the evaluation engine for a specific domain
engine = EvaluationEngine(domain="finance")

# Run evaluation
# You can optionally pass specific scenarios if needed, otherwise it uses the domain's default pack
results: list[EvaluationResult] = engine.evaluate(agent_outputs=agent_outputs)

# Get a summary of the results
summary: EvaluationSummary = engine.get_summary(results)

print(f"Total Scenarios: {summary.total_scenarios}")
print(f"Passed: {summary.passed}")
print(f"Failed: {summary.failed}")
print(f"Pass Rate: {summary.pass_rate:.2f}%")

for result in results:
    if not result.passed:
        print(f"Failed Scenario: {result.scenario_name}, Reason: {result.failure_reason}")

See the GitHub Actions workflow example.

Tips & Troubleshooting

Pro Tip: Use --quick-start for instant demo evaluation with sample data.

Note: ARC-Eval auto-detects many common agent output formats—no need to reformat your logs.

Warning: For agent-as-a-judge evaluation, you must set your API key (see above for details).

License & Links

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.9

Jun 4, 2025

0.2.8

Jun 1, 2025

0.2.7

May 30, 2025

0.2.6

May 29, 2025

0.2.5

May 28, 2025

0.2.4

May 28, 2025

0.2.3

May 27, 2025

0.2.2

May 27, 2025

0.2.1

May 26, 2025

0.2.0

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.9.tar.gz (447.2 kB view details)

Uploaded Jun 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_eval-0.2.9-py3-none-any.whl (500.9 kB view details)

Uploaded Jun 4, 2025 Python 3

File details

Details for the file arc_eval-0.2.9.tar.gz.

File metadata

Download URL: arc_eval-0.2.9.tar.gz
Upload date: Jun 4, 2025
Size: 447.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.9.tar.gz
Algorithm	Hash digest
SHA256	`56574a6e75044c3b0537713c6fb091f8a8730680ff6ae2035cde533a161e61f0`
MD5	`29401efce64df9850ef0be2e5a2c5b6b`
BLAKE2b-256	`3c28d09a5263763c039ffae40743339bd9256b7b2f28146f350f3d9f47a195fc`

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.9-py3-none-any.whl.

File metadata

Download URL: arc_eval-0.2.9-py3-none-any.whl
Upload date: Jun 4, 2025
Size: 500.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e176f10d5ddf1a43e2052f3299cdd57cb1028f34a0f0203ecafafb89215e3a5`
MD5	`f57e4bf3698b1a11069c092134b939de`
BLAKE2b-256	`0e97f8b47d30a950d3f64939c77ddfc7a8e3e1d7a396197c8b37fbabf5ee33b7`

See more details on using hashes here.

arc-eval 0.2.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC-Eval: Debug, Evaluate, and Improve AI Agents

Table of Contents

⚡ Quick Start (2 minutes)

Core Workflows

Key Features & Dashboard

Interactive Menus

Learning Dashboard

Versatile Export Options

Flexible Input & Auto-Detection

Multiple Ways to Provide Agent Outputs

Performance Optimization

Automatic Format Detection

Scenario Libraries & Regulations

Comprehensive Enterprise Test Suite

Real-World Enterprise Scenarios

How It Works

Examples & Integrations

Advanced Usage

Python SDK for Programmatic Evaluation

Tips & Troubleshooting

License & Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes