ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents
Project description
ARC-Eval: Agent-as-a-Judge Enterprise Platform
The first Agent-as-a-Judge platform for enterprise agent evaluation
Transform your agent compliance from static audits to continuous improvement. Get AI-powered feedback, CISO-ready reports, and actionable recommendations across 345 enterprise scenarios.
Quick Start
# Install
pip install arc-eval
# Try it instantly (no setup required)
arc-eval --quick-start --domain finance --agent-judge
# Evaluate your agent outputs
arc-eval --domain finance --input your_outputs.json --agent-judge
# Generate executive reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf
Need an API key? Set ANTHROPIC_API_KEY for Agent-as-a-Judge features, or use traditional evaluation without AI feedback.
Why Agent-as-a-Judge?
Traditional compliance tools give you pass/fail results. Agent-as-a-Judge gives you a path to improvement.
🎯 Value Delivered
- Continuous Feedback: AI judges provide actionable recommendations, not just scores
- Enterprise Scale: 345 scenarios across Finance (110), Security (120), ML (107) domains
- CISO-Ready: Executive reports with compliance framework mapping
- Cost Optimized: Smart model selection and fallbacks for production use
⚡ How It Works
Your Agent Output → AI Judge → Compliance Score + Improvement Plan + Training Signals → Self-Improvement Loop
Domains: Finance (SOX, KYC, AML) • Security (OWASP, MITRE) • ML (MLOps, EU AI Act)
Common Use Cases
# 🚀 Demo & Discovery
arc-eval --quick-start --domain finance --agent-judge
# 📊 Evaluate Your Agents
arc-eval --domain security --input outputs.json --agent-judge
# 🏢 Executive Reporting
arc-eval --domain ml --input outputs.json --agent-judge --export pdf --summary-only
# ⚙️ CI/CD Integration
arc-eval --domain finance --input logs.json --agent-judge --judge-model claude-3-5-haiku
More Examples: See
examples/for detailed workflows, input formats, and CI/CD templates.
Input Format
{"output": "Transaction approved for customer John Smith"}
ARC-Eval auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See examples/ for comprehensive format documentation.
Key Commands
# Essential flags
--domain finance|security|ml # Choose evaluation domain
--input file.json # Your agent outputs
--agent-judge # Enable AI feedback
--export pdf # Generate reports
# Useful options
--quick-start # Try with sample data
--judge-model auto|sonnet|haiku # Cost optimization
--summary-only # Executive reports only
--list-domains # See all scenarios
Full Reference: Run
arc-eval --helpor seeexamples/for complete documentation.
Enterprise Integration
CI/CD Pipeline
# Basic compliance gate
arc-eval --domain finance --input $CI_ARTIFACTS/logs.json --agent-judge
if [ $? -ne 0 ]; then exit 1; fi
Enterprise Features
- 345 Enterprise Scenarios: Finance (110) • Security (120) • ML (107)
- AI Judge Framework: SecurityJudge, FinanceJudge, MLJudge with continuous feedback
- Self-Improvement Engine: Automatic training data generation and retraining triggers from evaluation feedback
- CISO-Ready Reports: Executive dashboards with compliance framework mapping
- Cost Optimization: Smart model selection (Claude Sonnet ↔ Haiku)
- Production Templates: GitHub Actions, input formats, enterprise onboarding
Complete Integration Guide: See
examples/ci-templates/for production-ready CI/CD workflows.
What's Next?
- Try the Demo:
arc-eval --quick-start --domain finance --agent-judge - Explore Examples:
examples/for workflows and CI/CD templates - Enterprise Setup:
examples/ci-templates/for production deployment - Get Support: Run
arc-eval --helpor visit our documentation
ARC-Eval: Transform agent compliance from static audits to continuous improvement with AI-powered feedback.
MIT License • Documentation • GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_eval-0.2.1.tar.gz.
File metadata
- Download URL: arc_eval-0.2.1.tar.gz
- Upload date:
- Size: 137.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5d6d57fa1c5b0a7c11cbca5d72e9d9c5464f2e6c4bd9959cc0cc08c1c8088b2
|
|
| MD5 |
bad98fd86a1ced9e3377071188493f4e
|
|
| BLAKE2b-256 |
62804e015944d7d695450dc8703316651b5d5f4492cce22c1eee2edb99611122
|
File details
Details for the file arc_eval-0.2.1-py3-none-any.whl.
File metadata
- Download URL: arc_eval-0.2.1-py3-none-any.whl
- Upload date:
- Size: 141.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2389e73c79979ccfdde907399976c0a99efc7bf8acae13f2cdd5d3f99aa7722
|
|
| MD5 |
6f41341f5d19122024e161d6d15d40c9
|
|
| BLAKE2b-256 |
c1f066800d4c7aef0fa8ceb35a7c09566274392cee3eeef1f7aeba8126258569
|