ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

These details have not been verified by PyPI

Project links

Project description

ARC-Eval CLI

Agent Reliability & Compliance evaluation for LLMs and AI agents

ARC-Eval is a CLI-first platform that lets teams prove whether their agents are safe, reliable, and compliant with one command. Get actionable insights and audit-ready reports in seconds.

Quick Start

Installation

# Install from PyPI (recommended)
pip install arc-eval

# Or clone and install from source
git clone https://github.com/arc-computer/arc-eval
cd arc-eval  
pip install -e .

Try It Now (Zero Setup)

# Interactive demo with built-in sample data
arc-eval --quick-start

# Try different domains
arc-eval --quick-start --domain finance
arc-eval --quick-start --domain security  
arc-eval --quick-start --domain ml

# Generate executive report
arc-eval --quick-start --domain finance --export pdf --summary-only

Basic Usage

# Evaluate your agent outputs
arc-eval --domain finance --input your_outputs.json

# Validate input format first
arc-eval --validate --input your_outputs.json

# Generate audit-ready reports
arc-eval --domain finance --input outputs.json --export pdf --workflow

# Custom output location and format
arc-eval --domain finance --input outputs.json --export pdf --output-dir reports/ --format-template executive

How It Works

ARC-Eval evaluates your agent/LLM outputs against domain-specific compliance scenarios. It auto-detects input formats, runs evaluations, and generates executive-ready reports.

Input → Evaluation → Output

Feed agent outputs (JSON file, pipe, or demo data)
Select domain (finance, security, ml)
Get results (terminal dashboard + optional exports)

Key Capabilities

🚀 Zero-Friction Onboarding

Interactive demo mode with --quick-start
No API keys, accounts, or configuration required
Works completely offline

📋 Domain-Specific Evaluation Packs

Finance (15 scenarios): SOX, KYC, AML, PCI-DSS, GDPR, FFIEC, DORA, OFAC, CFPB, EU-AI-ACT
Security (15 scenarios): OWASP-LLM-TOP-10, NIST-AI-RMF, ISO-27001, SOC2-TYPE-II, MITRE-ATTACK
ML (15 scenarios): IEEE-ETHICS, MODEL-CARDS, ALGORITHMIC-ACCOUNTABILITY, MLOPS-GOVERNANCE

📊 Professional Output Formats

Rich Terminal UI: Executive dashboard with compliance framework breakdown
PDF Reports: Audit-ready with risk assessment and remediation guidance
CSV/JSON: Integration-friendly for CI/CD and data analysis
Format Templates: Executive, technical, compliance, or minimal styles

⚡ Power User Features

Custom Export Paths: --output-dir reports/ for organized file management
Executive Summary Mode: --summary-only for C-suite consumption
Performance Analytics: --timing with scaling projections and optimization insights
Input Validation: --validate to test formats before evaluation
Format Templates: --format-template executive for audience-specific reports

Usage Examples

Getting Started

# Try the interactive demo
arc-eval --quick-start --domain finance

# See all available domains and their coverage
arc-eval --list-domains

# Get help with input formats
arc-eval --help-input

Evaluation Workflows

# Basic evaluation
arc-eval --domain finance --input your_outputs.json

# With validation first
arc-eval --validate --input your_outputs.json
arc-eval --domain finance --input your_outputs.json

# Executive reporting
arc-eval --domain finance --input outputs.json --export pdf --summary-only --format-template executive

# Developer analysis
arc-eval --domain security --input outputs.json --dev --timing --verbose

# CI/CD integration
arc-eval --domain ml --input model_outputs.json --output json --output-dir reports/

Sample Output

 📊 Financial Services Compliance Evaluation Report 
══════════════════════════════════════════════════════════════════════
  📈 Pass Rate:             53.3%         ⚠️  Risk Level:         🔴 HIGH RISK   
  ✅ Passed:                  8           ❌ Failed:                  7         
  🔴 Critical:                3           🟡 High:                    3         
  🔵 Medium:                  1           📊 Total:                   15        

⚖️  Compliance Framework Dashboard
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Framework      ┃   Status    ┃ Scenarios ┃  Pass Rate  ┃ Issues              ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ AML            │ 🔴 CRITICAL │    4/8    │    50.0%    │ 🔴 3 Critical       │
│ KYC            │ 🔴 CRITICAL │    0/3    │    0.0%     │ 🔴 2 Critical       │
│ SOX            │ 🔴 CRITICAL │    2/4    │    50.0%    │ 🔴 1 Critical       │
│ PCI-DSS        │     ✅      │    1/1    │   100.0%    │ No issues           │
└────────────────┴─────────────┴───────────┴─────────────┴─────────────────────┘

📄 Audit Report: reports/arc-eval_finance_2024-05-24_executive_summary.pdf

Command Reference

Core Options

--domain - Select evaluation domain: finance, security, ml
--input - Input file with agent outputs (JSON format)
--stdin - Read from pipe instead of file
--quick-start - Demo mode with built-in sample data

Export & Output

--export - Export format: pdf, csv, json
--output-dir - Custom directory for exported files
--format-template - Report style: executive, technical, compliance, minimal
--summary-only - Generate executive summary only (skip detailed scenarios)

Analysis & Debugging

--dev - Developer mode with verbose technical details
--timing - Performance analytics with scaling projections
--verbose - Detailed logging and debugging information
--validate - Test input format without running evaluation

Help & Discovery

--list-domains - Show all available domains and their coverage
--help-input - Input format documentation with examples
--workflow - Audit/compliance reporting mode

Input Formats

ARC-Eval auto-detects and processes multiple input formats. Save your agent outputs to a JSON file or pipe them directly.

Universal Format (Recommended)

{"output": "Transaction approved for customer John Smith"}

Batch Processing

[
  {"output": "KYC verification completed successfully"},
  {"output": "Transaction flagged for manual review"},
  {"output": "Payment processing failed - insufficient funds"}
]

Framework Auto-Detection

ARC-Eval automatically handles outputs from:

OpenAI API

{"choices": [{"message": {"content": "Processing wire transfer..."}}]}

Anthropic API

{"content": "Transaction flagged for review..."}

LangChain

{"llm_output": "Customer identity verified", "agent_scratchpad": "..."}

Custom Agents

{"output": "Result", "metadata": {"confidence": 0.9, "model": "gpt-4"}}

Integration Patterns

CI/CD Pipeline Integration

# Basic compliance check
arc-eval --domain finance --input $CI_ARTIFACTS/agent_logs.json --output json
if [ $? -ne 0 ]; then
  echo "Critical compliance failures detected"
  exit 1
fi

# Generate compliance reports
arc-eval --domain security --input outputs.json --export pdf --output-dir reports/

Exit Codes

0 - All scenarios passed
1 - Critical failures detected
2 - Invalid input or configuration

Real-time Monitoring

# Pipe live agent outputs
tail -f agent.log | jq '.response' | arc-eval --domain ml --stdin

# Process API responses
curl -s https://my-agent.com/api/outputs | arc-eval --domain finance --stdin

Architecture

System Design

Input (JSON) → Parser → Evaluation Engine → Results → Exporters → Output
     ↓              ↓            ↓            ↓           ↓
  Auto-detect → Normalize → Domain Pack → Analysis → PDF/CSV/JSON

Project Structure

agent_eval/
├── core/              # Evaluation engine and types
├── domains/           # YAML evaluation packs (45 scenarios)
├── exporters/         # PDF, CSV, JSON report generators
└── cli.py            # Command-line interface

Domain Coverage

Finance Domain (15 scenarios)

Identity verification & KYC compliance
Sanctions & AML screening
Transaction monitoring & fraud detection
Data protection (PCI-DSS, GDPR)
Financial reporting accuracy (SOX, DORA)

Security Domain (15 scenarios)

Prompt injection & data leakage
Code security & access control
AI agent safety & OWASP compliance
Infrastructure security (ISO-27001, SOC2)

ML Domain (15 scenarios)

Bias detection & algorithmic fairness
Model governance & ethics compliance
Data governance & safety alignment
MLOps best practices

Development

Local Development

git clone https://github.com/arc-computer/arc-eval
cd arc-eval
pip install -e .

# Test your changes
arc-eval --quick-start --domain finance

Running Tests

pip install -e ".[dev]"
pytest tests/

License

MIT License - see LICENSE file for details.

ARC-Eval: Boardroom-ready trust for autonomous software—run, audit, fix.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.9

Jun 4, 2025

0.2.8

Jun 1, 2025

0.2.7

May 30, 2025

0.2.6

May 29, 2025

0.2.5

May 28, 2025

0.2.4

May 28, 2025

0.2.3

May 27, 2025

0.2.2

May 27, 2025

0.2.1

May 26, 2025

This version

0.2.0

May 25, 2025

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.2.0.tar.gz (49.2 kB view details)

Uploaded May 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_eval-0.2.0-py3-none-any.whl (51.0 kB view details)

Uploaded May 25, 2025 Python 3

File details

Details for the file arc_eval-0.2.0.tar.gz.

File metadata

Download URL: arc_eval-0.2.0.tar.gz
Upload date: May 25, 2025
Size: 49.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`001bb9f5362ce9613906ee6a6a971074ae7a12b077823acfd12affb6545ff286`
MD5	`f493f90059584a116df4d116977b3484`
BLAKE2b-256	`5a853d0d13212816ae0353570902a3c3b906db4848f3f3e5a7cc508b26a3179c`

See more details on using hashes here.

File details

Details for the file arc_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: arc_eval-0.2.0-py3-none-any.whl
Upload date: May 25, 2025
Size: 51.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5ec57da84059f383f39c2aaac4743c9e4c10b542f3556529247723144a84686`
MD5	`7267e9d78f27dcee9860197e9d45dda4`
BLAKE2b-256	`ea898a16fcaea62b26d247f9459142b7ac71e6300b98c1cf006000e72272ea7c`

See more details on using hashes here.

arc-eval 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARC-Eval CLI

Quick Start

Installation

Try It Now (Zero Setup)

Basic Usage

How It Works

Input → Evaluation → Output

Key Capabilities

Usage Examples

Getting Started

Evaluation Workflows

Sample Output

Command Reference

Core Options

Export & Output

Analysis & Debugging

Help & Discovery

Input Formats

Universal Format (Recommended)

Batch Processing

Framework Auto-Detection

Integration Patterns

CI/CD Pipeline Integration

Exit Codes

Real-time Monitoring

Architecture

System Design

Project Structure

Domain Coverage

Development

Local Development

Running Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes