ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents
Project description
ARC-Eval CLI
Agent Reliability & Compliance evaluation for LLMs and AI agents
ARC-Eval is a CLI-first platform that lets teams prove whether their agents are safe, reliable, and compliant with one command. Get actionable insights and audit-ready reports in seconds.
Quick Start
Installation
# Install from PyPI (recommended)
pip install arc-eval
# Or clone and install from source
git clone https://github.com/arc-computer/arc-eval
cd arc-eval
pip install -e .
Try It Now (Zero Setup)
# Interactive demo with built-in sample data
arc-eval --quick-start
# Try different domains
arc-eval --quick-start --domain finance
arc-eval --quick-start --domain security
arc-eval --quick-start --domain ml
# Generate executive report
arc-eval --quick-start --domain finance --export pdf --summary-only
Basic Usage
# Evaluate your agent outputs
arc-eval --domain finance --input your_outputs.json
# Validate input format first
arc-eval --validate --input your_outputs.json
# Generate audit-ready reports
arc-eval --domain finance --input outputs.json --export pdf --workflow
# Custom output location and format
arc-eval --domain finance --input outputs.json --export pdf --output-dir reports/ --format-template executive
How It Works
ARC-Eval evaluates your agent/LLM outputs against domain-specific compliance scenarios. It auto-detects input formats, runs evaluations, and generates executive-ready reports.
Input โ Evaluation โ Output
- Feed agent outputs (JSON file, pipe, or demo data)
- Select domain (finance, security, ml)
- Get results (terminal dashboard + optional exports)
Key Capabilities
๐ Zero-Friction Onboarding
- Interactive demo mode with
--quick-start - No API keys, accounts, or configuration required
- Works completely offline
๐ Domain-Specific Evaluation Packs
- Finance (15 scenarios): SOX, KYC, AML, PCI-DSS, GDPR, FFIEC, DORA, OFAC, CFPB, EU-AI-ACT
- Security (15 scenarios): OWASP-LLM-TOP-10, NIST-AI-RMF, ISO-27001, SOC2-TYPE-II, MITRE-ATTACK
- ML (15 scenarios): IEEE-ETHICS, MODEL-CARDS, ALGORITHMIC-ACCOUNTABILITY, MLOPS-GOVERNANCE
๐ Professional Output Formats
- Rich Terminal UI: Executive dashboard with compliance framework breakdown
- PDF Reports: Audit-ready with risk assessment and remediation guidance
- CSV/JSON: Integration-friendly for CI/CD and data analysis
- Format Templates: Executive, technical, compliance, or minimal styles
โก Power User Features
- Custom Export Paths:
--output-dir reports/for organized file management - Executive Summary Mode:
--summary-onlyfor C-suite consumption - Performance Analytics:
--timingwith scaling projections and optimization insights - Input Validation:
--validateto test formats before evaluation - Format Templates:
--format-template executivefor audience-specific reports
Usage Examples
Getting Started
# Try the interactive demo
arc-eval --quick-start --domain finance
# See all available domains and their coverage
arc-eval --list-domains
# Get help with input formats
arc-eval --help-input
Evaluation Workflows
# Basic evaluation
arc-eval --domain finance --input your_outputs.json
# With validation first
arc-eval --validate --input your_outputs.json
arc-eval --domain finance --input your_outputs.json
# Executive reporting
arc-eval --domain finance --input outputs.json --export pdf --summary-only --format-template executive
# Developer analysis
arc-eval --domain security --input outputs.json --dev --timing --verbose
# CI/CD integration
arc-eval --domain ml --input model_outputs.json --output json --output-dir reports/
Sample Output
๐ Financial Services Compliance Evaluation Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Pass Rate: 53.3% โ ๏ธ Risk Level: ๐ด HIGH RISK
โ
Passed: 8 โ Failed: 7
๐ด Critical: 3 ๐ก High: 3
๐ต Medium: 1 ๐ Total: 15
โ๏ธ Compliance Framework Dashboard
โโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโ
โ Framework โ Status โ Scenarios โ Pass Rate โ Issues โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ AML โ ๐ด CRITICAL โ 4/8 โ 50.0% โ ๐ด 3 Critical โ
โ KYC โ ๐ด CRITICAL โ 0/3 โ 0.0% โ ๐ด 2 Critical โ
โ SOX โ ๐ด CRITICAL โ 2/4 โ 50.0% โ ๐ด 1 Critical โ
โ PCI-DSS โ โ
โ 1/1 โ 100.0% โ No issues โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
๐ Audit Report: reports/arc-eval_finance_2024-05-24_executive_summary.pdf
Command Reference
Core Options
--domain- Select evaluation domain:finance,security,ml--input- Input file with agent outputs (JSON format)--stdin- Read from pipe instead of file--quick-start- Demo mode with built-in sample data
Export & Output
--export- Export format:pdf,csv,json--output-dir- Custom directory for exported files--format-template- Report style:executive,technical,compliance,minimal--summary-only- Generate executive summary only (skip detailed scenarios)
Analysis & Debugging
--dev- Developer mode with verbose technical details--timing- Performance analytics with scaling projections--verbose- Detailed logging and debugging information--validate- Test input format without running evaluation
Help & Discovery
--list-domains- Show all available domains and their coverage--help-input- Input format documentation with examples--workflow- Audit/compliance reporting mode
Input Formats
ARC-Eval auto-detects and processes multiple input formats. Save your agent outputs to a JSON file or pipe them directly.
Universal Format (Recommended)
{"output": "Transaction approved for customer John Smith"}
Batch Processing
[
{"output": "KYC verification completed successfully"},
{"output": "Transaction flagged for manual review"},
{"output": "Payment processing failed - insufficient funds"}
]
Framework Auto-Detection
ARC-Eval automatically handles outputs from:
OpenAI API
{"choices": [{"message": {"content": "Processing wire transfer..."}}]}
Anthropic API
{"content": "Transaction flagged for review..."}
LangChain
{"llm_output": "Customer identity verified", "agent_scratchpad": "..."}
Custom Agents
{"output": "Result", "metadata": {"confidence": 0.9, "model": "gpt-4"}}
Integration Patterns
CI/CD Pipeline Integration
# Basic compliance check
arc-eval --domain finance --input $CI_ARTIFACTS/agent_logs.json --output json
if [ $? -ne 0 ]; then
echo "Critical compliance failures detected"
exit 1
fi
# Generate compliance reports
arc-eval --domain security --input outputs.json --export pdf --output-dir reports/
Exit Codes
0- All scenarios passed1- Critical failures detected2- Invalid input or configuration
Real-time Monitoring
# Pipe live agent outputs
tail -f agent.log | jq '.response' | arc-eval --domain ml --stdin
# Process API responses
curl -s https://my-agent.com/api/outputs | arc-eval --domain finance --stdin
Architecture
System Design
Input (JSON) โ Parser โ Evaluation Engine โ Results โ Exporters โ Output
โ โ โ โ โ
Auto-detect โ Normalize โ Domain Pack โ Analysis โ PDF/CSV/JSON
Project Structure
agent_eval/
โโโ core/ # Evaluation engine and types
โโโ domains/ # YAML evaluation packs (45 scenarios)
โโโ exporters/ # PDF, CSV, JSON report generators
โโโ cli.py # Command-line interface
Domain Coverage
Finance Domain (15 scenarios)
- Identity verification & KYC compliance
- Sanctions & AML screening
- Transaction monitoring & fraud detection
- Data protection (PCI-DSS, GDPR)
- Financial reporting accuracy (SOX, DORA)
Security Domain (15 scenarios)
- Prompt injection & data leakage
- Code security & access control
- AI agent safety & OWASP compliance
- Infrastructure security (ISO-27001, SOC2)
ML Domain (15 scenarios)
- Bias detection & algorithmic fairness
- Model governance & ethics compliance
- Data governance & safety alignment
- MLOps best practices
Development
Local Development
git clone https://github.com/arc-computer/arc-eval
cd arc-eval
pip install -e .
# Test your changes
arc-eval --quick-start --domain finance
Running Tests
pip install -e ".[dev]"
pytest tests/
License
MIT License - see LICENSE file for details.
ARC-Eval: Boardroom-ready trust for autonomous softwareโrun, audit, fix.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_eval-0.2.0.tar.gz.
File metadata
- Download URL: arc_eval-0.2.0.tar.gz
- Upload date:
- Size: 49.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
001bb9f5362ce9613906ee6a6a971074ae7a12b077823acfd12affb6545ff286
|
|
| MD5 |
f493f90059584a116df4d116977b3484
|
|
| BLAKE2b-256 |
5a853d0d13212816ae0353570902a3c3b906db4848f3f3e5a7cc508b26a3179c
|
File details
Details for the file arc_eval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: arc_eval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5ec57da84059f383f39c2aaac4743c9e4c10b542f3556529247723144a84686
|
|
| MD5 |
7267e9d78f27dcee9860197e9d45dda4
|
|
| BLAKE2b-256 |
ea898a16fcaea62b26d247f9459142b7ac71e6300b98c1cf006000e72272ea7c
|