Skip to main content

ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

Project description

ARC-Eval: Domain-Specific Agent Evaluation

PyPI version License: MIT Python 3.9+

ARC-Eval Agent-as-a-Judge Demo

Agent-as-a-Judge evaluation with domain-specific compliance assessment and improvement recommendations

ARC-Eval is a domain-specific agent evaluation tool that runs over 345 targeted scenarios across security, finance, and ML infrastructure, using a single specialist LLM per domain as a judge to assess outputs for compliance, reliability, and failure modes.

As AI agents are deployed in critical production systems, teams lack rigorous, domain-aligned, and explainable evaluation frameworks to surface compliance gaps, security risks, and operational errors—especially at the depth demanded by regulated industries and research.

Instead of relying on generic LLM-as-a-judge scoring or crowd-sourced prompts, ARC-Eval offers deep, enterprise-mapped scenario packs with outputs reviewed by a dedicated domain expert agent (SecurityJudge, FinanceJudge, MLJudge). This enables actionable, fine-grained feedback and concrete remediation, not just pass/fail scores.

Built on the Agent-as-a-Judge framework from MetaAuto AI (arXiv:2410.10934v2)

Quick Start

# Install
pip install arc-eval

# Set up Agent-as-a-Judge (recommended)
export ANTHROPIC_API_KEY="your-key-here"
# Or add to .env file: ANTHROPIC_API_KEY=your-key-here

# Try with sample data
arc-eval --quick-start --domain finance --agent-judge

# Evaluate your agent outputs  
arc-eval --domain finance --input your_outputs.json --agent-judge

# Generate compliance reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf

Note: Agent-as-a-Judge requires ANTHROPIC_API_KEY. Traditional evaluation (without judge feedback) works without API keys.

Agent-as-a-Judge Framework

Based on the MetaAuto AI research, Agent-as-a-Judge provides contextual evaluation using domain-specific judge models that understand compliance requirements and failure modes.

Key Features

  • Domain-Specific Judges: FinanceJudge, SecurityJudge, MLJudge with specialized knowledge
  • 345 Evaluation Scenarios: Finance (110), Security (120), ML (107) covering real-world compliance
  • Continuous Feedback: Actionable improvement recommendations with training signal generation
  • Multi-Model Support: Claude Sonnet, Haiku with automatic cost optimization

Evaluation Pipeline

Agent Output → Domain Judge → Compliance Assessment + Improvement Recommendations + Training Signals

Compliance Frameworks: Finance (SOX, KYC, AML) • Security (OWASP, MITRE) • ML (MLOps, EU AI Act)

Usage Examples

# Agent-as-a-Judge evaluation (recommended)
arc-eval --domain finance --input outputs.json --agent-judge

# Academic benchmark evaluation
arc-eval --benchmark mmlu --subset anatomy --limit 20 --agent-judge

# Enhanced reliability with verification
arc-eval --domain security --input outputs.json --agent-judge --verify

# Confidence calibration for uncertainty quantification
arc-eval --domain ml --input outputs.json --agent-judge --confidence-calibration

# A/B test judge configurations
arc-eval --compare-judges config/templates.yaml --domain finance --input outputs.json

# Generate compliance reports
arc-eval --domain ml --input outputs.json --agent-judge --export pdf

# CI/CD integration with cost optimization
arc-eval --domain finance --input logs.json --agent-judge --judge-model claude-3-5-haiku

More Examples: See examples/ for detailed workflows, input formats, and CI/CD templates.

Input Format

{"output": "Transaction approved for customer John Smith"}

ARC-Eval auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See examples/ for comprehensive format documentation.

Key Commands

# Essential flags
--domain finance|security|ml    # Choose evaluation domain
--input file.json               # Your agent outputs
--agent-judge                   # Enable Agent-as-a-Judge evaluation
--export pdf                    # Generate compliance reports

# Advanced evaluation
--benchmark mmlu|humeval|gsm8k  # Academic benchmark evaluation
--verify                        # Secondary judge validation (reliability)
--confidence-calibration        # Enhanced uncertainty quantification
--compare-judges config.yaml    # A/B test judge configurations

# Useful options  
--quick-start                   # Try with sample data
--judge-model auto|sonnet|haiku # Model selection for cost optimization
--summary-only                  # Executive summary only
--list-domains                  # See all evaluation scenarios

Full Reference: Run arc-eval --help or see examples/ for complete documentation.

Production Integration

CI/CD Pipeline

# Automated compliance gate
arc-eval --domain finance --input $CI_ARTIFACTS/logs.json --agent-judge
if [ $? -ne 0 ]; then exit 1; fi

Continuous Improvement Pipeline

ARC-Eval builds toward turning evaluation outcomes into agent retraining and RL environments, enabling agents to improve iteratively based on real-world, regulatory-grade benchmarks:

  • 345 Evaluation Scenarios: Finance (110) • Security (120) • ML (107)
  • Domain-Specific Judges: SecurityJudge, FinanceJudge, MLJudge with specialized knowledge
  • Self-Improvement Engine: Training data generation and retraining triggers from evaluation feedback
  • Compliance Reports: Automated report generation with regulatory framework mapping
  • Model Optimization: Adaptive model selection (Claude Sonnet ↔ Haiku)
  • Integration Templates: GitHub Actions, input formats, production deployment guides

Complete Integration Guide: See examples/ci-templates/ for production-ready CI/CD workflows.


Getting Started

  1. Set API Key: export ANTHROPIC_API_KEY="your-key-here" or add to .env file
  2. Try Demo: arc-eval --quick-start --domain finance --agent-judge
  3. Evaluate Agents: arc-eval --domain security --input outputs.json --agent-judge
  4. See Examples: examples/ for workflows and integration guides

Research & References

  • Agent-as-a-Judge Framework: MetaAuto AI
  • Research Paper: arXiv:2410.10934v2
  • Domain Evaluation: 345 scenarios across Finance, Security, and ML compliance

ARC-Eval: Domain-specific agent evaluation using the Agent-as-a-Judge framework for continuous improvement.

MIT License • DocumentationGitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.2.tar.gz (178.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_eval-0.2.2-py3-none-any.whl (191.6 kB view details)

Uploaded Python 3

File details

Details for the file arc_eval-0.2.2.tar.gz.

File metadata

  • Download URL: arc_eval-0.2.2.tar.gz
  • Upload date:
  • Size: 178.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.2.tar.gz
Algorithm Hash digest
SHA256 289c98e090d07fde32f18e9a3fd50d3eec028f5ab1d3b56f7728773303c99ac2
MD5 5c6f903a3a3411a694e6f0599a42effd
BLAKE2b-256 3fc7f0138e2f9cd7433fdd1b5cba94e640aa11368e357bf968b91b4468373a82

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: arc_eval-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 191.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e5b4dabe58eecacad9f24a081b61ed7280ddbf96c3f9c7f4970146e6f5224138
MD5 8490ef31c72e6b61df2b776aced05b6f
BLAKE2b-256 9e7150fc530aef341a4394f6bac76390492c6068aa7e6d5fed3e411ab43d6115

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page