ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

These details have not been verified by PyPI

Project links

Project description

ARC-Eval: Agentic Workflow Reliability Platform

Agent-as-a-Judge Demo

ARC-Eval evaluates AI agent reliability and provides debugging tools for agent workflow failures. Built on the MetaAuto AI Agent-as-a-Judge framework, it includes 355 domain-specific evaluation scenarios across security, finance, and ML infrastructure.

The platform addresses two core problems: debugging agent failures in production and evaluating agent behavior against domain requirements. It provides framework-agnostic analysis for LangChain, CrewAI, OpenAI, Anthropic, and other agent implementations.

Quick Start

pip install arc-eval
export ANTHROPIC_API_KEY="your-key-here"

# Debug agent failures
arc-eval --debug-agent --input agent_outputs.json

# Run domain evaluations
arc-eval --domain finance --input outputs.json --agent-judge

# Try demo
arc-eval --quick-start --domain security

Architecture

ARC-Eval implements Agent-as-a-Judge evaluation (Wang et al., 2024) with production extensions for reliability analysis and continuous improvement.

graph TB
    A[Agent Outputs<br/>JSON/Logs] --> B[Framework Detection]
    B --> C[Domain Evaluation<br/>355 Scenarios]
    C --> D[Agent Judge<br/>Claude Sonnet/Haiku]
    D --> E[Reliability Analysis]
    E --> F[Improvement Planning]
    F --> G[Performance Tracking]
    
    B --> B1[LangChain Parser]
    B --> B2[OpenAI Parser]
    B --> B3[CrewAI Parser]
    B --> B4[Custom Parser]
    
    C --> C1[Finance<br/>110 scenarios]
    C --> C2[Security<br/>120 scenarios]
    C --> C3[ML Safety<br/>125 scenarios]
    
    D --> D1[Multi-Judge<br/>Verification]
    D --> D2[Confidence<br/>Calibration]
    D --> D3[Interactive<br/>Analysis]
    
    E --> E1[Debug Reports]
    E --> E2[Compliance Audit]
    E --> E3[Performance Metrics]
    
    F --> F1[Curriculum Generation]
    F --> F2[Training Examples]
    F --> F3[Trend Analysis]
    
    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style E fill:#e8f5e8
    style F fill:#fff3e0

The system includes six core components that work together to provide comprehensive agent evaluation and improvement.

Agent Judge Framework provides domain-specific evaluation using Claude Sonnet for accuracy and Haiku for cost optimization. Each domain judge specializes in finance, security, or ML scenarios with deep knowledge of relevant compliance frameworks and failure patterns.

Multi-Judge Verification implements secondary judge validation for critical decisions through agreement scoring and consensus mechanisms. This reduces false positives and ensures high-confidence evaluation results for production systems.

Interactive Analysis enables AI-powered Q&A sessions using Claude Sonnet 4 for understanding evaluation failures and patterns. Developers can ask domain-specific questions about failures and receive contextual debugging guidance.

Confidence Calibration performs logprobs-based uncertainty quantification with decision token analysis and entropy calculations. This provides statistical confidence intervals for evaluation results and identifies areas of uncertainty.

Self-Improvement Engine generates automated retraining curriculum from failed evaluations and tracks performance trends across iterations. This enables continuous learning where agents improve from their own evaluation failures.

Performance Tracking monitors runtime, memory, and cost analysis with latency percentile calculations and efficiency benchmarking. This provides operational insights for production deployment optimization.

Evaluation Domains

ARC-Eval includes three evaluation packs covering 355 scenarios designed for production agent systems across different domains.

Finance (110 scenarios) covers SOX, KYC, AML, PCI-DSS, and GDPR compliance testing for banking, fintech, and payment processing agents. These scenarios test for real-world issues like PII exposure, transaction monitoring violations, and regulatory compliance gaps that can result in significant penalties.

Security (120 scenarios) implements OWASP LLM Top 10, NIST AI-RMF, and ISO 27001 vulnerability assessment for AI agents, chatbots, and code generation systems. The scenarios cover prompt injection, data leakage, authentication bypass, and other security vulnerabilities specific to LLM-based systems.

ML (125 scenarios) evaluates EU AI Act, IEEE Ethics, and Model Cards compliance for ML safety, bias detection, and governance. These scenarios test for algorithmic bias, fairness violations, explainability requirements, and other ethical AI considerations.

Each domain pack includes detailed scenario definitions, compliance framework mappings, and specific remediation guidance tailored to the regulatory requirements and best practices for that domain.

Core Improvement Loop

ARC-Eval implements a complete evaluation and improvement cycle for agent performance. The system identifies failures, generates specific improvement plans, and measures progress through quantitative before/after comparisons.

graph LR
    A[Agent Outputs] --> B[Domain Evaluation<br/>355 Scenarios]
    B --> C[Failure Analysis<br/>Pattern Detection]
    C --> D[Improvement Planning<br/>Prioritized Actions]
    D --> E[Implementation<br/>Agent Updates]
    E --> F[Re-evaluation<br/>Progress Measurement]
    F --> G[Performance Comparison<br/>Baseline vs Improved]
    
    C --> H[Self-Improvement<br/>Engine]
    H --> I[Curriculum Generation]
    H --> J[Training Examples]
    H --> K[Trend Analysis]
    
    G --> L{Satisfactory<br/>Results?}
    L -->|No| C
    L -->|Yes| M[Production Deployment]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#ffebee
    style D fill:#e8f5e8
    style E fill:#fff3e0
    style F fill:#f3e5f5
    style G fill:#e8f5e8
    style M fill:#e1f5fe

The improvement loop operates through three interconnected phases. First, the evaluation engine runs 355 domain-specific scenarios against your agent outputs to identify specific failure patterns, compliance violations, and performance bottlenecks. Each scenario tests for real-world issues like PII exposure, security vulnerabilities, or bias in decision-making.

Second, the improvement planner analyzes evaluation results and generates actionable improvement plans with prioritized actions, expected timelines, and measurable impact projections. The planner categorizes issues by severity and provides specific implementation steps for each identified problem.

Third, the comparison engine measures improvement between baseline and updated implementations through quantitative before/after analysis. This creates a feedback loop where each iteration builds measurably better agent systems.

Example: Finance Agent Improvement

Step 1: Baseline Evaluation

arc-eval --domain finance --input examples/sample-data/finance_baseline.json --agent-judge
# Result: 40% pass rate, identifies PII exposure and AML violations

Step 2: Generate Improvement Plan

arc-eval --improvement-plan --from-evaluation finance_evaluation_20250527_143022.json
# Generates: improvement_plan_20250527_143522.md with specific actions

Step 3: Implement Changes and Re-evaluate

arc-eval --domain finance --input examples/sample-data/finance_improved.json --baseline finance_evaluation_20250527_143022.json
# Result: 85% pass rate, demonstrating 45% improvement

The self-improvement engine complements this workflow by creating retraining curriculum from failed scenarios and tracking performance trends across multiple iterations. This enables continuous learning where agents automatically improve from their own evaluation failures.

Before/After Example

Baseline Output (fails PII protection):

{
  "output": "Customer John Smith (SSN: 123-45-6789) approved for $50,000 loan"
}

Improved Output (passes compliance):

{
  "output": "Customer <REDACTED> approved for $50,000 loan based on verified income and credit score. PII protection protocols applied."
}

This concrete example demonstrates the loop in action: PII exposure eliminated, AML controls added, bias reduced. Each evaluation cycle produces measurable improvements that compound over time to create robust, compliant agent systems suitable for production deployment.

CLI Usage

Reliability Analysis

# Debug agent failures with framework detection
arc-eval --debug-agent --input agent_outputs.json

# Unified debugging dashboard
arc-eval --unified-debug --input workflow_trace.json

# Framework-specific analysis
arc-eval --workflow-reliability --framework langchain --input outputs.json

Domain Evaluation

# Run compliance scenarios
arc-eval --domain finance --input outputs.json --agent-judge

# Export reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf

Improvement Workflows

# Generate improvement plans
arc-eval --improvement-plan --from-evaluation results.json

# Compare judge configurations
arc-eval --compare-judges config/judge_comparison_templates.yaml --input outputs.json

# Full automation cycle
arc-eval --full-workflow --domain finance --input baseline.json

Interactive analysis starts automatically after evaluations unless --no-interaction is specified.

Python SDK

Advanced features are available through the Python SDK for programmatic access to the evaluation and improvement systems. These provide deeper integration capabilities beyond the CLI interface.

Interactive Analysis (SDK Only)

from agent_eval.analysis.interactive_analyst import InteractiveAnalyst

analyst = InteractiveAnalyst(domain="finance")
analyst.start_interactive_session(
    improvement_report=evaluation_results,
    judge_results=judge_results
)

Interactive analysis automatically starts after CLI evaluations unless --no-interaction is specified.

Multi-Judge Verification (CLI: `--verify`)

from agent_eval.evaluation.verification_judge import VerificationJudge

verifier = VerificationJudge(domain="security")
verification = verifier.verify_judgment(primary_result, output, scenario)
print(f"Verified: {verification.verified}, Agreement: {verification.agreement_score}")

CLI equivalent: arc-eval --domain security --input outputs.json --agent-judge --verify

Confidence Calibration (CLI: `--confidence-calibration`)

from agent_eval.evaluation.confidence_calibrator import ConfidenceCalibrator

calibrator = ConfidenceCalibrator()
calibration = calibrator.calibrate_confidence(response_text, logprobs)
print(f"Calibrated confidence: {calibration.calibrated_confidence}")

CLI equivalent: arc-eval --domain finance --input outputs.json --agent-judge --confidence-calibration

Self-Improvement Engine (SDK Only)

from agent_eval.analysis.self_improvement import SelfImprovementEngine

engine = SelfImprovementEngine()
curriculum = engine.generate_retraining_curriculum(failed_scenarios)
trends = engine.get_performance_trends(agent_id="agent-1", domain="finance")

The CLI generates improvement plans with --improvement-plan but full curriculum generation requires the SDK.

Judge Comparison (CLI: `--compare-judges`)

from agent_eval.analysis.judge_comparison import JudgeComparison

comparison = JudgeComparison()
results = comparison.run_comparison(
    judge_configs=config_data['judges'],
    agent_outputs=outputs
)

CLI equivalent: arc-eval --compare-judges config/judge_comparison_templates.yaml --input outputs.json

Results can be exported as JSON for integration with monitoring systems or as PDF reports for compliance audits. Export comprehensive reports suitable for internal audits and regulatory compliance documentation. The evaluation pipeline integrates with existing MLOps workflows and provides automated quality gates for agent deployments.

Performance Tracking (CLI: `--performance`)

from agent_eval.evaluation.performance_tracker import PerformanceTracker

tracker = PerformanceTracker()
with tracker.track_agent_execution():
    # Your agent execution code
    pass

metrics = tracker.get_performance_summary()

CLI equivalent: arc-eval --domain finance --input outputs.json --agent-judge --performance

Input Formats

ARC-Eval auto-detects agent outputs from these frameworks through its parser registry system: The system supports custom formats through the parser registry in agent_eval/core/parser_registry.py. This enables evaluation of agents built with any framework or custom implementation. Easily extend ARC-Eval to support your proprietary agent frameworks through a well-defined parser plugin system.

// OpenAI/Anthropic format
[{
  "scenario_id": "fin_001",
  "messages": [...],
  "tool_calls": [...],
  "output": "Transaction approved for John Smith"
}]

// LangChain format
[{
  "intermediate_steps": [...],
  "agent_scratchpad": "...",
  "output": "Analysis complete"
}]

// CrewAI format
[{
  "crew_output": "...",
  "task_output": "...",
  "agent_execution": [...]
}]

The system supports custom formats through the parser registry in agent_eval/core/parser_registry.py. This enables evaluation of agents built with any framework or custom implementation.

Research Foundation

ARC-Eval extends the Agent-as-a-Judge framework with domain-specific evaluation and production reliability features. The base framework (arXiv:2410.10934v2) provides LLM-based evaluation methodology.

ARC-Eval adds domain specialization with finance, security, and ML expertise tailored to real-world compliance requirements. Multi-judge consensus provides critical decision validation through secondary evaluation and agreement scoring. Interactive analysis enables AI-powered debugging sessions for understanding complex failure patterns. Confidence calibration uses logprobs for uncertainty quantification and statistical confidence intervals. Self-improvement loops provide automated curriculum generation and continuous learning from evaluation failures.

The system is designed for ML engineers and AI researchers working on production agent systems at scale. It addresses the gap between research evaluation frameworks and production reliability requirements for agent systems deployed in regulated industries.

CI/CD Integration

# GitHub Actions example
- name: Agent Reliability Check
  run: |
    arc-eval --domain finance --input agent_outputs.json --agent-judge
    arc-eval --compare-judges config/production_judges.yaml --input agent_outputs.json

Results can be exported as JSON for integration with monitoring systems or as PDF reports for compliance audits. The evaluation pipeline integrates with existing MLOps workflows and provides automated quality gates for agent deployments.

Documentation

Quick Start Guide provides setup instructions and basic usage examples for immediate implementation.

Examples contains demo data, integration examples, and CI/CD templates for common deployment patterns.

Advanced features like interactive analysis, judge comparison, and self-improvement are documented in the Python SDK examples with complete code samples and usage patterns.

Contributing

ARC-Eval is built for AI researchers and ML engineers working on production agent systems. Contributions are welcome for new domain evaluation packs that address specific industry requirements, additional framework parsers for emerging agent platforms, evaluation scenario improvements based on real-world failure patterns, and performance optimizations for large-scale deployments.

See CONTRIBUTING.md for detailed contribution guidelines and development setup instructions.

Support

Report issues and feature requests on GitHub Issues with detailed reproduction steps and system information.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.9

Jun 4, 2025

0.2.8

Jun 1, 2025

0.2.7

May 30, 2025

0.2.6

May 29, 2025

This version

0.2.5

May 28, 2025

0.2.4

May 28, 2025

0.2.3

May 27, 2025

0.2.2

May 27, 2025

0.2.1

May 26, 2025

0.2.0

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.5.tar.gz (243.7 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_eval-0.2.5-py3-none-any.whl (268.7 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file arc_eval-0.2.5.tar.gz.

File metadata

Download URL: arc_eval-0.2.5.tar.gz
Upload date: May 28, 2025
Size: 243.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`c5772710d1cd4f6d1fec53dfafc1a3c3c7ce13d04da8e79fef2a29678d6285fa`
MD5	`32dc89c18fabb49373b2f9f6bc996cb7`
BLAKE2b-256	`d055846344e4e45c76c1b01058c60d7fbc7ca23988a4a33adaf31c221e0c0cb5`

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.5-py3-none-any.whl.

File metadata

Download URL: arc_eval-0.2.5-py3-none-any.whl
Upload date: May 28, 2025
Size: 268.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64187af761470ed64f97560c7e3bc106a334caa5b7fb7953d2fb4a8750f415d7`
MD5	`fce8a2b88b722d8c6af43a7eb77b1b2c`
BLAKE2b-256	`4c1b29b73d92f671d0c2fc5cb62639b53733ac5ccef39a35bea48c0ca629e8de`

See more details on using hashes here.

arc-eval 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC-Eval: Agentic Workflow Reliability Platform

Quick Start

Architecture

Evaluation Domains

Core Improvement Loop

Example: Finance Agent Improvement

Before/After Example

CLI Usage

Reliability Analysis

Domain Evaluation

Improvement Workflows

Python SDK

Interactive Analysis (SDK Only)

Multi-Judge Verification (CLI: --verify)

Confidence Calibration (CLI: --confidence-calibration)

Self-Improvement Engine (SDK Only)

Judge Comparison (CLI: --compare-judges)

Performance Tracking (CLI: --performance)

Input Formats

Research Foundation

CI/CD Integration

Documentation

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Multi-Judge Verification (CLI: `--verify`)

Confidence Calibration (CLI: `--confidence-calibration`)

Judge Comparison (CLI: `--compare-judges`)

Performance Tracking (CLI: `--performance`)