Skip to main content

ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

Project description

ARC-Eval: Agent-as-a-Judge Enterprise Platform

PyPI version License: MIT Python 3.9+

The first Agent-as-a-Judge platform for enterprise agent evaluation

Transform your agent compliance from static audits to continuous improvement. Get AI-powered feedback, CISO-ready reports, and actionable recommendations across 345 enterprise scenarios.

Quick Start

# Install
pip install arc-eval

# Try it instantly (no setup required)
arc-eval --quick-start --domain finance --agent-judge

# Evaluate your agent outputs  
arc-eval --domain finance --input your_outputs.json --agent-judge

# Generate executive reports
arc-eval --domain security --input outputs.json --agent-judge --export pdf

Need an API key? Set ANTHROPIC_API_KEY for Agent-as-a-Judge features, or use traditional evaluation without AI feedback.

Why Agent-as-a-Judge?

Traditional compliance tools give you pass/fail results. Agent-as-a-Judge gives you a path to improvement.

🎯 Value Delivered

  • Continuous Feedback: AI judges provide actionable recommendations, not just scores
  • Enterprise Scale: 345 scenarios across Finance (110), Security (120), ML (107) domains
  • CISO-Ready: Executive reports with compliance framework mapping
  • Cost Optimized: Smart model selection and fallbacks for production use

⚡ How It Works

Your Agent Output → AI Judge → Compliance Score + Improvement Plan + Training Signals → Self-Improvement Loop

Domains: Finance (SOX, KYC, AML) • Security (OWASP, MITRE) • ML (MLOps, EU AI Act)

Common Use Cases

# 🚀 Demo & Discovery
arc-eval --quick-start --domain finance --agent-judge

# 📊 Evaluate Your Agents  
arc-eval --domain security --input outputs.json --agent-judge

# 🏢 Executive Reporting
arc-eval --domain ml --input outputs.json --agent-judge --export pdf --summary-only

# ⚙️ CI/CD Integration
arc-eval --domain finance --input logs.json --agent-judge --judge-model claude-3-5-haiku

More Examples: See examples/ for detailed workflows, input formats, and CI/CD templates.

Input Format

{"output": "Transaction approved for customer John Smith"}

ARC-Eval auto-detects formats from OpenAI, Anthropic, LangChain, and custom agents. See examples/ for comprehensive format documentation.

Key Commands

# Essential flags
--domain finance|security|ml    # Choose evaluation domain
--input file.json               # Your agent outputs
--agent-judge                   # Enable AI feedback
--export pdf                    # Generate reports

# Useful options  
--quick-start                   # Try with sample data
--judge-model auto|sonnet|haiku # Cost optimization
--summary-only                  # Executive reports only
--list-domains                  # See all scenarios

Full Reference: Run arc-eval --help or see examples/ for complete documentation.

Enterprise Integration

CI/CD Pipeline

# Basic compliance gate
arc-eval --domain finance --input $CI_ARTIFACTS/logs.json --agent-judge
if [ $? -ne 0 ]; then exit 1; fi

Enterprise Features

  • 345 Enterprise Scenarios: Finance (110) • Security (120) • ML (107)
  • AI Judge Framework: SecurityJudge, FinanceJudge, MLJudge with continuous feedback
  • Self-Improvement Engine: Automatic training data generation and retraining triggers from evaluation feedback
  • CISO-Ready Reports: Executive dashboards with compliance framework mapping
  • Cost Optimization: Smart model selection (Claude Sonnet ↔ Haiku)
  • Production Templates: GitHub Actions, input formats, enterprise onboarding

Complete Integration Guide: See examples/ci-templates/ for production-ready CI/CD workflows.


What's Next?

  1. Try the Demo: arc-eval --quick-start --domain finance --agent-judge
  2. Explore Examples: examples/ for workflows and CI/CD templates
  3. Enterprise Setup: examples/ci-templates/ for production deployment
  4. Get Support: Run arc-eval --help or visit our documentation

ARC-Eval: Transform agent compliance from static audits to continuous improvement with AI-powered feedback.

MIT License • DocumentationGitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.1.tar.gz (137.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_eval-0.2.1-py3-none-any.whl (141.3 kB view details)

Uploaded Python 3

File details

Details for the file arc_eval-0.2.1.tar.gz.

File metadata

  • Download URL: arc_eval-0.2.1.tar.gz
  • Upload date:
  • Size: 137.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e5d6d57fa1c5b0a7c11cbca5d72e9d9c5464f2e6c4bd9959cc0cc08c1c8088b2
MD5 bad98fd86a1ced9e3377071188493f4e
BLAKE2b-256 62804e015944d7d695450dc8703316651b5d5f4492cce22c1eee2edb99611122

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: arc_eval-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 141.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e2389e73c79979ccfdde907399976c0a99efc7bf8acae13f2cdd5d3f99aa7722
MD5 6f41341f5d19122024e161d6d15d40c9
BLAKE2b-256 c1f066800d4c7aef0fa8ceb35a7c09566274392cee3eeef1f7aeba8126258569

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page