ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

These details have not been verified by PyPI

Project links

Project description

ARC-Eval

ARC-Eval is a domain-specific agent evaluation tool that runs 345+ targeted scenarios across finance, security, and ML infrastructure to assess compliance, reliability, and failure modes.

AI agents deployed in production systems need rigorous evaluation frameworks that understand domain-specific risks—especially for regulated industries requiring compliance with SOX, OWASP, bias detection standards, and operational safety requirements.

Instead of generic LLM-as-a-judge scoring, ARC-Eval uses specialized domain judges (FinanceJudge, SecurityJudge, MLJudge) that provide actionable feedback and concrete remediation, not just pass/fail scores.

Key Features:

345 targeted scenarios across finance (SOX, KYC), security (OWASP), ML (bias detection)
Domain-specific judges with compliance expertise and detailed failure analysis
Core improvement loop: Evaluate → Generate improvement plan → Re-evaluate → Compare results
Built on Agent-as-a-Judge framework (arXiv:2410.10934v2)

Quick Start

# Install
pip install arc-eval

# Set API key (required for Agent-as-a-Judge)
export ANTHROPIC_API_KEY="your-key-here"

# Try with sample data
arc-eval --quick-start --domain finance --agent-judge

# Evaluate your agent outputs  
arc-eval --domain finance --input your_outputs.json --agent-judge

🚀 Intelligent Workflow (NEW!)

ARC-Eval now features intelligent workflow automation that eliminates manual filename copying and provides contextual guidance:

# Smart defaults activate automatically
arc-eval --domain finance --input your_data.json
# → Auto-enables Agent-as-a-Judge for large files
# → Auto-enables PDF export for finance/security domains  
# → Auto-enables verification for ML domain

# Zero-configuration workflow continuation
arc-eval --continue
# → Auto-detects latest evaluation
# → Suggests next step (plan → re-evaluate)
# → Interactive prompts with smart defaults

Enterprise Shortcuts

# Complete compliance audit workflow
arc-eval --domain finance --input data.json --audit

# Cost-optimized development workflow
arc-eval --domain security --input data.json --dev-mode

Core Loop Workflow

📊 Evaluate → 📋 Plan → 🔄 Re-evaluate → 📈 Compare

Traditional Manual Workflow

# Step 1: Initial Evaluation
arc-eval --domain finance --input baseline_data.json --agent-judge
# → Auto-saves: finance_evaluation_20240527_143022.json

# Step 2: Generate Improvement Plan (manual filename copying)
arc-eval --improvement-plan --from-evaluation finance_evaluation_20240527_143022.json
# → Auto-saves: improvement_plan_20240527_143025.md

Step 3: Re-evaluate with Comparison

arc-eval --domain finance --input improved_data.json --baseline finance_evaluation_20240527_143022.json  
# → Shows: before/after metrics, scenario-level improvements

🚀 Intelligent Workflow (Recommended)

# Step 1: Initial evaluation with smart defaults
arc-eval --domain finance --input your_data.json
# → Smart defaults auto-activate based on domain and file size

# Step 2: Continue workflow (zero configuration)
arc-eval --continue
# → Auto-detects latest evaluation and workflow state
# → Guides you to next step with interactive prompts

# Step 3: Follow guided next steps
arc-eval --continue
# → Automatically suggests re-evaluation when improvement plan exists

Key Commands

# Essential commands
arc-eval --domain finance --input outputs.json --agent-judge     # Basic evaluation
arc-eval --improvement-plan --from-evaluation evaluation.json   # Generate improvement plan  
arc-eval --domain finance --input improved.json --baseline old.json  # Compare improvements

# Domain evaluation  
arc-eval --domain finance|security|ml --input data.json --agent-judge

# Export reports
arc-eval --domain finance --input data.json --agent-judge --export pdf

# Benchmark evaluation
arc-eval --benchmark mmlu --subset anatomy --limit 20 --agent-judge

Input Format

[
  {"scenario_id": "fin_001", "output": "Transaction approved for customer John Smith"},
  {"scenario_id": "fin_002", "output": "KYC verification completed successfully"}
]

Auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See arc-eval --help-input for details.

Example Output

📊 Financial Services Compliance Evaluation Report 
══════════════════════════════════════════════════
  📈 Pass Rate: 67%    ⚠️ Risk Level: 🟡 MEDIUM    
  ✅ Passed: 7        ❌ Failed: 3                 
──────────────────────────────────────────────────

⚖️ Compliance Framework Dashboard
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Framework  ┃   Status    ┃  Pass Rate  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ SOX        │     ✅      │   100%      │
│ KYC        │ 🔴 CRITICAL │    33%      │
│ AML        │     ✅      │   100%      │
└────────────┴─────────────┴─────────────┘

Evaluation Domains

Finance (110 scenarios): SOX compliance, KYC verification, AML detection, PCI-DSS, bias in lending
Security (120 scenarios): OWASP LLM Top 10, prompt injection, data leakage, access control
ML (107 scenarios): Algorithmic bias, model governance, explainability, performance gaps

CI/CD Integration

# GitHub Actions example
arc-eval --domain security --input $CI_ARTIFACTS/agent_outputs.json --agent-judge --export json

See examples/ci-cd/ for complete integration templates.

Built on Agent-as-a-Judge framework • arXiv:2410.10934v2

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.9

Jun 4, 2025

0.2.8

Jun 1, 2025

0.2.7

May 30, 2025

0.2.6

May 29, 2025

0.2.5

May 28, 2025

This version

0.2.4

May 28, 2025

0.2.3

May 27, 2025

0.2.2

May 27, 2025

0.2.1

May 26, 2025

0.2.0

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.4.tar.gz (219.9 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_eval-0.2.4-py3-none-any.whl (238.1 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file arc_eval-0.2.4.tar.gz.

File metadata

Download URL: arc_eval-0.2.4.tar.gz
Upload date: May 28, 2025
Size: 219.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`038bf5f6b3b80344de642d07b4ea4ee9b47ad8c1b47e200ed53a03f3440e7e5d`
MD5	`3a39bb87940e027b315cb1859f51e2a9`
BLAKE2b-256	`007dd3ff28b7cd8cf7da0c3e47ba55aebb8c900ff0ca82dc72cbcdddd473c95f`

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.4-py3-none-any.whl.

File metadata

Download URL: arc_eval-0.2.4-py3-none-any.whl
Upload date: May 28, 2025
Size: 238.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40b899bdd8d3d058e0f28190b510b3b58abfe730ee7598f0d1a54af2219d24f0`
MD5	`4ad7ed93cae70e26b2bc509921f58609`
BLAKE2b-256	`8451e7d08979811aebed400592c897aca3aad06a146b04f0bb2fc962b56e0cfe`

See more details on using hashes here.

arc-eval 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC-Eval

Quick Start

🚀 Intelligent Workflow (NEW!)

Enterprise Shortcuts

Core Loop Workflow

Traditional Manual Workflow

Step 3: Re-evaluate with Comparison

🚀 Intelligent Workflow (Recommended)

Key Commands

Input Format

Example Output

Evaluation Domains

CI/CD Integration

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes