Skip to main content

ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

Project description

ARC-Eval CLI

PyPI version License: MIT Python 3.9+

Agent Reliability & Compliance evaluation for LLMs and AI agents

ARC-Eval is a CLI-first platform that lets teams prove whether their agents are safe, reliable, and compliant with one command. Get actionable insights and audit-ready reports in seconds.

Quick Start

Installation

# Install from PyPI (recommended)
pip install arc-eval

# Or clone and install from source
git clone https://github.com/arc-computer/arc-eval
cd arc-eval  
pip install -e .

Try It Now (Zero Setup)

# Interactive demo with built-in sample data
arc-eval --quick-start

# Try different domains
arc-eval --quick-start --domain finance
arc-eval --quick-start --domain security  
arc-eval --quick-start --domain ml

# Generate executive report
arc-eval --quick-start --domain finance --export pdf --summary-only

Basic Usage

# Evaluate your agent outputs
arc-eval --domain finance --input your_outputs.json

# Validate input format first
arc-eval --validate --input your_outputs.json

# Generate audit-ready reports
arc-eval --domain finance --input outputs.json --export pdf --workflow

# Custom output location and format
arc-eval --domain finance --input outputs.json --export pdf --output-dir reports/ --format-template executive

How It Works

ARC-Eval evaluates your agent/LLM outputs against domain-specific compliance scenarios. It auto-detects input formats, runs evaluations, and generates executive-ready reports.

Input โ†’ Evaluation โ†’ Output

  1. Feed agent outputs (JSON file, pipe, or demo data)
  2. Select domain (finance, security, ml)
  3. Get results (terminal dashboard + optional exports)

Key Capabilities

๐Ÿš€ Zero-Friction Onboarding

  • Interactive demo mode with --quick-start
  • No API keys, accounts, or configuration required
  • Works completely offline

๐Ÿ“‹ Domain-Specific Evaluation Packs

  • Finance (15 scenarios): SOX, KYC, AML, PCI-DSS, GDPR, FFIEC, DORA, OFAC, CFPB, EU-AI-ACT
  • Security (15 scenarios): OWASP-LLM-TOP-10, NIST-AI-RMF, ISO-27001, SOC2-TYPE-II, MITRE-ATTACK
  • ML (15 scenarios): IEEE-ETHICS, MODEL-CARDS, ALGORITHMIC-ACCOUNTABILITY, MLOPS-GOVERNANCE

๐Ÿ“Š Professional Output Formats

  • Rich Terminal UI: Executive dashboard with compliance framework breakdown
  • PDF Reports: Audit-ready with risk assessment and remediation guidance
  • CSV/JSON: Integration-friendly for CI/CD and data analysis
  • Format Templates: Executive, technical, compliance, or minimal styles

โšก Power User Features

  • Custom Export Paths: --output-dir reports/ for organized file management
  • Executive Summary Mode: --summary-only for C-suite consumption
  • Performance Analytics: --timing with scaling projections and optimization insights
  • Input Validation: --validate to test formats before evaluation
  • Format Templates: --format-template executive for audience-specific reports

Usage Examples

Getting Started

# Try the interactive demo
arc-eval --quick-start --domain finance

# See all available domains and their coverage
arc-eval --list-domains

# Get help with input formats
arc-eval --help-input

Evaluation Workflows

# Basic evaluation
arc-eval --domain finance --input your_outputs.json

# With validation first
arc-eval --validate --input your_outputs.json
arc-eval --domain finance --input your_outputs.json

# Executive reporting
arc-eval --domain finance --input outputs.json --export pdf --summary-only --format-template executive

# Developer analysis
arc-eval --domain security --input outputs.json --dev --timing --verbose

# CI/CD integration
arc-eval --domain ml --input model_outputs.json --output json --output-dir reports/

Sample Output

 ๐Ÿ“Š Financial Services Compliance Evaluation Report 
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
  ๐Ÿ“ˆ Pass Rate:             53.3%         โš ๏ธ  Risk Level:         ๐Ÿ”ด HIGH RISK   
  โœ… Passed:                  8           โŒ Failed:                  7         
  ๐Ÿ”ด Critical:                3           ๐ŸŸก High:                    3         
  ๐Ÿ”ต Medium:                  1           ๐Ÿ“Š Total:                   15        

โš–๏ธ  Compliance Framework Dashboard
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Framework      โ”ƒ   Status    โ”ƒ Scenarios โ”ƒ  Pass Rate  โ”ƒ Issues              โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ AML            โ”‚ ๐Ÿ”ด CRITICAL โ”‚    4/8    โ”‚    50.0%    โ”‚ ๐Ÿ”ด 3 Critical       โ”‚
โ”‚ KYC            โ”‚ ๐Ÿ”ด CRITICAL โ”‚    0/3    โ”‚    0.0%     โ”‚ ๐Ÿ”ด 2 Critical       โ”‚
โ”‚ SOX            โ”‚ ๐Ÿ”ด CRITICAL โ”‚    2/4    โ”‚    50.0%    โ”‚ ๐Ÿ”ด 1 Critical       โ”‚
โ”‚ PCI-DSS        โ”‚     โœ…      โ”‚    1/1    โ”‚   100.0%    โ”‚ No issues           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“„ Audit Report: reports/arc-eval_finance_2024-05-24_executive_summary.pdf

Command Reference

Core Options

  • --domain - Select evaluation domain: finance, security, ml
  • --input - Input file with agent outputs (JSON format)
  • --stdin - Read from pipe instead of file
  • --quick-start - Demo mode with built-in sample data

Export & Output

  • --export - Export format: pdf, csv, json
  • --output-dir - Custom directory for exported files
  • --format-template - Report style: executive, technical, compliance, minimal
  • --summary-only - Generate executive summary only (skip detailed scenarios)

Analysis & Debugging

  • --dev - Developer mode with verbose technical details
  • --timing - Performance analytics with scaling projections
  • --verbose - Detailed logging and debugging information
  • --validate - Test input format without running evaluation

Help & Discovery

  • --list-domains - Show all available domains and their coverage
  • --help-input - Input format documentation with examples
  • --workflow - Audit/compliance reporting mode

Input Formats

ARC-Eval auto-detects and processes multiple input formats. Save your agent outputs to a JSON file or pipe them directly.

Universal Format (Recommended)

{"output": "Transaction approved for customer John Smith"}

Batch Processing

[
  {"output": "KYC verification completed successfully"},
  {"output": "Transaction flagged for manual review"},
  {"output": "Payment processing failed - insufficient funds"}
]

Framework Auto-Detection

ARC-Eval automatically handles outputs from:

OpenAI API

{"choices": [{"message": {"content": "Processing wire transfer..."}}]}

Anthropic API

{"content": "Transaction flagged for review..."}

LangChain

{"llm_output": "Customer identity verified", "agent_scratchpad": "..."}

Custom Agents

{"output": "Result", "metadata": {"confidence": 0.9, "model": "gpt-4"}}

Integration Patterns

CI/CD Pipeline Integration

# Basic compliance check
arc-eval --domain finance --input $CI_ARTIFACTS/agent_logs.json --output json
if [ $? -ne 0 ]; then
  echo "Critical compliance failures detected"
  exit 1
fi

# Generate compliance reports
arc-eval --domain security --input outputs.json --export pdf --output-dir reports/

Exit Codes

  • 0 - All scenarios passed
  • 1 - Critical failures detected
  • 2 - Invalid input or configuration

Real-time Monitoring

# Pipe live agent outputs
tail -f agent.log | jq '.response' | arc-eval --domain ml --stdin

# Process API responses
curl -s https://my-agent.com/api/outputs | arc-eval --domain finance --stdin

Architecture

System Design

Input (JSON) โ†’ Parser โ†’ Evaluation Engine โ†’ Results โ†’ Exporters โ†’ Output
     โ†“              โ†“            โ†“            โ†“           โ†“
  Auto-detect โ†’ Normalize โ†’ Domain Pack โ†’ Analysis โ†’ PDF/CSV/JSON

Project Structure

agent_eval/
โ”œโ”€โ”€ core/              # Evaluation engine and types
โ”œโ”€โ”€ domains/           # YAML evaluation packs (45 scenarios)
โ”œโ”€โ”€ exporters/         # PDF, CSV, JSON report generators
โ””โ”€โ”€ cli.py            # Command-line interface

Domain Coverage

Finance Domain (15 scenarios)

  • Identity verification & KYC compliance
  • Sanctions & AML screening
  • Transaction monitoring & fraud detection
  • Data protection (PCI-DSS, GDPR)
  • Financial reporting accuracy (SOX, DORA)

Security Domain (15 scenarios)

  • Prompt injection & data leakage
  • Code security & access control
  • AI agent safety & OWASP compliance
  • Infrastructure security (ISO-27001, SOC2)

ML Domain (15 scenarios)

  • Bias detection & algorithmic fairness
  • Model governance & ethics compliance
  • Data governance & safety alignment
  • MLOps best practices

Development

Local Development

git clone https://github.com/arc-computer/arc-eval
cd arc-eval
pip install -e .

# Test your changes
arc-eval --quick-start --domain finance

Running Tests

pip install -e ".[dev]"
pytest tests/

License

MIT License - see LICENSE file for details.

ARC-Eval: Boardroom-ready trust for autonomous softwareโ€”run, audit, fix.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.0.tar.gz (49.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_eval-0.2.0-py3-none-any.whl (51.0 kB view details)

Uploaded Python 3

File details

Details for the file arc_eval-0.2.0.tar.gz.

File metadata

  • Download URL: arc_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 49.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 001bb9f5362ce9613906ee6a6a971074ae7a12b077823acfd12affb6545ff286
MD5 f493f90059584a116df4d116977b3484
BLAKE2b-256 5a853d0d13212816ae0353570902a3c3b906db4848f3f3e5a7cc508b26a3179c

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: arc_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5ec57da84059f383f39c2aaac4743c9e4c10b542f3556529247723144a84686
MD5 7267e9d78f27dcee9860197e9d45dda4
BLAKE2b-256 ea898a16fcaea62b26d247f9459142b7ac71e6300b98c1cf006000e72272ea7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page