ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents
Project description
ARC-Eval: Domain-Specific Agent Evaluation
Agent-as-a-Judge evaluation with domain-specific compliance assessment and improvement recommendations
ARC-Eval is a domain-specific agent evaluation tool that runs over 345 targeted scenarios across security, finance, and ML infrastructure, using a single specialist LLM per domain as a judge to assess outputs for compliance, reliability, and failure modes.
As AI agents are deployed in critical production systems, teams lack rigorous, domain-aligned, and explainable evaluation frameworks to surface compliance gaps, security risks, and operational errors—especially at the depth demanded by regulated industries and research.
Instead of relying on generic LLM-as-a-judge scoring or crowd-sourced prompts, ARC-Eval offers deep, enterprise-mapped scenario packs with outputs reviewed by a dedicated domain expert agent (SecurityJudge, FinanceJudge, MLJudge). This enables actionable, fine-grained feedback and concrete remediation, not just pass/fail scores.
Built on the Agent-as-a-Judge framework from MetaAuto AI (arXiv:2410.10934v2)
Quick Start
# Install
pip install arc-eval
# Set up Agent-as-a-Judge (recommended)
export ANTHROPIC_API_KEY="your-key-here"
# Or add to .env file: ANTHROPIC_API_KEY=your-key-here
# Try with sample data
arc-eval --quick-start --domain finance --agent-judge
# Evaluate your agent outputs
arc-eval --domain finance --input your_outputs.json --agent-judge
# Generate compliance reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf
Note: Agent-as-a-Judge requires ANTHROPIC_API_KEY. Traditional evaluation (without judge feedback) works without API keys.
Agent-as-a-Judge Framework
Based on the MetaAuto AI research, Agent-as-a-Judge provides contextual evaluation using domain-specific judge models that understand compliance requirements and failure modes.
Key Features
- Domain-Specific Judges: FinanceJudge, SecurityJudge, MLJudge with specialized knowledge
- 345 Evaluation Scenarios: Finance (110), Security (120), ML (107) covering real-world compliance
- Continuous Feedback: Actionable improvement recommendations with training signal generation
- Multi-Model Support: Claude Sonnet, Haiku with automatic cost optimization
Evaluation Pipeline
Agent Output → Domain Judge → Compliance Assessment + Improvement Recommendations + Training Signals
Compliance Frameworks: Finance (SOX, KYC, AML) • Security (OWASP, MITRE) • ML (MLOps, EU AI Act)
Usage Examples
# Agent-as-a-Judge evaluation (recommended)
arc-eval --domain finance --input outputs.json --agent-judge
# Academic benchmark evaluation
arc-eval --benchmark mmlu --subset anatomy --limit 20 --agent-judge
# Enhanced reliability with verification
arc-eval --domain security --input outputs.json --agent-judge --verify
# Confidence calibration for uncertainty quantification
arc-eval --domain ml --input outputs.json --agent-judge --confidence-calibration
# A/B test judge configurations
arc-eval --compare-judges config/templates.yaml --domain finance --input outputs.json
# Generate compliance reports
arc-eval --domain ml --input outputs.json --agent-judge --export pdf
# CI/CD integration with cost optimization
arc-eval --domain finance --input logs.json --agent-judge --judge-model claude-3-5-haiku
More Examples: See
examples/for detailed workflows, input formats, and CI/CD templates.
Input Format
{"output": "Transaction approved for customer John Smith"}
ARC-Eval auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See examples/ for comprehensive format documentation.
Key Commands
# Essential flags
--domain finance|security|ml # Choose evaluation domain
--input file.json # Your agent outputs
--agent-judge # Enable Agent-as-a-Judge evaluation
--export pdf # Generate compliance reports
# Advanced evaluation
--benchmark mmlu|humeval|gsm8k # Academic benchmark evaluation
--verify # Secondary judge validation (reliability)
--confidence-calibration # Enhanced uncertainty quantification
--compare-judges config.yaml # A/B test judge configurations
# Useful options
--quick-start # Try with sample data
--judge-model auto|sonnet|haiku # Model selection for cost optimization
--summary-only # Executive summary only
--list-domains # See all evaluation scenarios
Full Reference: Run
arc-eval --helpor seeexamples/for complete documentation.
Production Integration
CI/CD Pipeline
# Automated compliance gate
arc-eval --domain finance --input $CI_ARTIFACTS/logs.json --agent-judge
if [ $? -ne 0 ]; then exit 1; fi
Continuous Improvement Pipeline
ARC-Eval builds toward turning evaluation outcomes into agent retraining and RL environments, enabling agents to improve iteratively based on real-world, regulatory-grade benchmarks:
- 345 Evaluation Scenarios: Finance (110) • Security (120) • ML (107)
- Domain-Specific Judges: SecurityJudge, FinanceJudge, MLJudge with specialized knowledge
- Self-Improvement Engine: Training data generation and retraining triggers from evaluation feedback
- Compliance Reports: Automated report generation with regulatory framework mapping
- Model Optimization: Adaptive model selection (Claude Sonnet ↔ Haiku)
- Integration Templates: GitHub Actions, input formats, production deployment guides
Complete Integration Guide: See
examples/ci-templates/for production-ready CI/CD workflows.
Getting Started
- Set API Key:
export ANTHROPIC_API_KEY="your-key-here"or add to.envfile - Try Demo:
arc-eval --quick-start --domain finance --agent-judge - Evaluate Agents:
arc-eval --domain security --input outputs.json --agent-judge - See Examples:
examples/for workflows and integration guides
Research & References
- Agent-as-a-Judge Framework: MetaAuto AI
- Research Paper: arXiv:2410.10934v2
- Domain Evaluation: 345 scenarios across Finance, Security, and ML compliance
ARC-Eval: Domain-specific agent evaluation using the Agent-as-a-Judge framework for continuous improvement.
MIT License • Documentation • GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_eval-0.2.2.tar.gz.
File metadata
- Download URL: arc_eval-0.2.2.tar.gz
- Upload date:
- Size: 178.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
289c98e090d07fde32f18e9a3fd50d3eec028f5ab1d3b56f7728773303c99ac2
|
|
| MD5 |
5c6f903a3a3411a694e6f0599a42effd
|
|
| BLAKE2b-256 |
3fc7f0138e2f9cd7433fdd1b5cba94e640aa11368e357bf968b91b4468373a82
|
File details
Details for the file arc_eval-0.2.2-py3-none-any.whl.
File metadata
- Download URL: arc_eval-0.2.2-py3-none-any.whl
- Upload date:
- Size: 191.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5b4dabe58eecacad9f24a081b61ed7280ddbf96c3f9c7f4970146e6f5224138
|
|
| MD5 |
8490ef31c72e6b61df2b776aced05b6f
|
|
| BLAKE2b-256 |
9e7150fc530aef341a4394f6bac76390492c6068aa7e6d5fed3e411ab43d6115
|