ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

These details have not been verified by PyPI

Project links

Project description

ARC-Eval: Domain-Specific Agent Evaluation

Agent-as-a-Judge evaluation with domain-specific compliance assessment and improvement recommendations

ARC-Eval is a domain-specific agent evaluation tool that runs over 345 targeted scenarios across security, finance, and ML infrastructure, using a single specialist LLM per domain as a judge to assess outputs for compliance, reliability, and failure modes.

As AI agents are deployed in critical production systems, teams lack rigorous, domain-aligned, and explainable evaluation frameworks to surface compliance gaps, security risks, and operational errors—especially at the depth demanded by regulated industries and research.

Instead of relying on generic LLM-as-a-judge scoring or crowd-sourced prompts, ARC-Eval offers deep, enterprise-mapped scenario packs with outputs reviewed by a dedicated domain expert agent (SecurityJudge, FinanceJudge, MLJudge). This enables actionable, fine-grained feedback and concrete remediation, not just pass/fail scores.

Built on the Agent-as-a-Judge framework from MetaAuto AI (arXiv:2410.10934v2)

Quick Start

# Install
pip install arc-eval

# Set up Agent-as-a-Judge (recommended)
export ANTHROPIC_API_KEY="your-key-here"
# Or add to .env file: ANTHROPIC_API_KEY=your-key-here

# Try with sample data
arc-eval --quick-start --domain finance --agent-judge

# Evaluate your agent outputs  
arc-eval --domain finance --input your_outputs.json --agent-judge

# Generate compliance reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf

Note: Agent-as-a-Judge requires ANTHROPIC_API_KEY. Traditional evaluation (without judge feedback) works without API keys.

Agent-as-a-Judge Framework

Based on the MetaAuto AI research, Agent-as-a-Judge provides contextual evaluation using domain-specific judge models that understand compliance requirements and failure modes.

Key Features

Domain-Specific Judges: FinanceJudge, SecurityJudge, MLJudge with specialized knowledge
345 Evaluation Scenarios: Finance (110), Security (120), ML (107) covering real-world compliance
Continuous Feedback: Actionable improvement recommendations with training signal generation
Multi-Model Support: Claude Sonnet, Haiku with automatic cost optimization

Evaluation Pipeline

Agent Output → Domain Judge → Compliance Assessment + Improvement Recommendations + Training Signals

Compliance Frameworks: Finance (SOX, KYC, AML) • Security (OWASP, MITRE) • ML (MLOps, EU AI Act)

Usage Examples

# Agent-as-a-Judge evaluation (recommended)
arc-eval --domain finance --input outputs.json --agent-judge

# Academic benchmark evaluation
arc-eval --benchmark mmlu --subset anatomy --limit 20 --agent-judge

# Enhanced reliability with verification
arc-eval --domain security --input outputs.json --agent-judge --verify

# Confidence calibration for uncertainty quantification
arc-eval --domain ml --input outputs.json --agent-judge --confidence-calibration

# A/B test judge configurations
arc-eval --compare-judges config/templates.yaml --domain finance --input outputs.json

# Generate compliance reports
arc-eval --domain ml --input outputs.json --agent-judge --export pdf

# CI/CD integration with cost optimization
arc-eval --domain finance --input logs.json --agent-judge --judge-model claude-3-5-haiku

More Examples: See examples/ for detailed workflows, input formats, and CI/CD templates.

Input Format

{"output": "Transaction approved for customer John Smith"}

ARC-Eval auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See examples/ for comprehensive format documentation.

Key Commands

# Essential flags
--domain finance|security|ml    # Choose evaluation domain
--input file.json               # Your agent outputs
--agent-judge                   # Enable Agent-as-a-Judge evaluation
--export pdf                    # Generate compliance reports

# Advanced evaluation
--benchmark mmlu|humeval|gsm8k  # Academic benchmark evaluation
--verify                        # Secondary judge validation (reliability)
--confidence-calibration        # Enhanced uncertainty quantification
--compare-judges config.yaml    # A/B test judge configurations

# Useful options  
--quick-start                   # Try with sample data
--judge-model auto|sonnet|haiku # Model selection for cost optimization
--summary-only                  # Executive summary only
--list-domains                  # See all evaluation scenarios

Full Reference: Run arc-eval --help or see examples/ for complete documentation.

Production Integration

CI/CD Pipeline

# Automated compliance gate
arc-eval --domain finance --input $CI_ARTIFACTS/logs.json --agent-judge
if [ $? -ne 0 ]; then exit 1; fi

Continuous Improvement Pipeline

ARC-Eval builds toward turning evaluation outcomes into agent retraining and RL environments, enabling agents to improve iteratively based on real-world, regulatory-grade benchmarks:

345 Evaluation Scenarios: Finance (110) • Security (120) • ML (107)
Domain-Specific Judges: SecurityJudge, FinanceJudge, MLJudge with specialized knowledge
Self-Improvement Engine: Training data generation and retraining triggers from evaluation feedback
Compliance Reports: Automated report generation with regulatory framework mapping
Model Optimization: Adaptive model selection (Claude Sonnet ↔ Haiku)
Integration Templates: GitHub Actions, input formats, production deployment guides

Complete Integration Guide: See examples/ci-templates/ for production-ready CI/CD workflows.

Getting Started

Set API Key: export ANTHROPIC_API_KEY="your-key-here" or add to .env file
Try Demo: arc-eval --quick-start --domain finance --agent-judge
Evaluate Agents: arc-eval --domain security --input outputs.json --agent-judge
See Examples: examples/ for workflows and integration guides

Research & References

Agent-as-a-Judge Framework: MetaAuto AI
Research Paper: arXiv:2410.10934v2
Domain Evaluation: 345 scenarios across Finance, Security, and ML compliance

ARC-Eval: Domain-specific agent evaluation using the Agent-as-a-Judge framework for continuous improvement.

MIT License • Documentation • GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.9

Jun 4, 2025

0.2.8

Jun 1, 2025

0.2.7

May 30, 2025

0.2.6

May 29, 2025

0.2.5

May 28, 2025

0.2.4

May 28, 2025

0.2.3

May 27, 2025

This version

0.2.2

May 27, 2025

0.2.1

May 26, 2025

0.2.0

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.2.tar.gz (178.8 kB view details)

Uploaded May 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_eval-0.2.2-py3-none-any.whl (191.6 kB view details)

Uploaded May 27, 2025 Python 3

File details

Details for the file arc_eval-0.2.2.tar.gz.

File metadata

Download URL: arc_eval-0.2.2.tar.gz
Upload date: May 27, 2025
Size: 178.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`289c98e090d07fde32f18e9a3fd50d3eec028f5ab1d3b56f7728773303c99ac2`
MD5	`5c6f903a3a3411a694e6f0599a42effd`
BLAKE2b-256	`3fc7f0138e2f9cd7433fdd1b5cba94e640aa11368e357bf968b91b4468373a82`

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.2-py3-none-any.whl.

File metadata

Download URL: arc_eval-0.2.2-py3-none-any.whl
Upload date: May 27, 2025
Size: 191.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5b4dabe58eecacad9f24a081b61ed7280ddbf96c3f9c7f4970146e6f5224138`
MD5	`8490ef31c72e6b61df2b776aced05b6f`
BLAKE2b-256	`9e7150fc530aef341a4394f6bac76390492c6068aa7e6d5fed3e411ab43d6115`

See more details on using hashes here.

arc-eval 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC-Eval: Domain-Specific Agent Evaluation

Quick Start

Agent-as-a-Judge Framework

Key Features

Evaluation Pipeline

Usage Examples

Input Format

Key Commands

Production Integration

CI/CD Pipeline

Continuous Improvement Pipeline

Getting Started

Research & References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes