Skip to main content

ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

Project description

ARC-Eval CLI

PyPI version License: MIT Python 3.9+

Agent Reliability & Compliance evaluation for LLMs and AI agents

ARC-Eval is a CLI-first platform that lets teams prove whether their agents are safe, reliable, and compliant with one command. Get actionable insights and audit-ready reports in seconds.

Quick Start

Installation

# Install from PyPI (recommended)
pip install arc-eval

# Verify installation
arc-eval --help

# Test with sample data
echo '{"output": "Transaction approved"}' | arc-eval --domain finance

# Or clone and install from source
git clone https://github.com/arc-computer/arc-eval
cd arc-eval  
pip install -e .

Basic Usage

# Evaluate finance compliance on agent outputs
arc-eval --domain finance --input examples/sample_agent_outputs.json

# Generate audit report  
arc-eval --domain finance --input examples/failing_agent_outputs.json --export pdf

# Developer mode with verbose output
arc-eval --domain finance --input examples/sample_agent_outputs.json --dev

# CSV export for data analysis
arc-eval --domain finance --input examples/failing_agent_outputs.json --export csv

Features

โœ… Zero-Config First Run

  • No API keys required
  • No account setup needed
  • Works completely offline

๐ŸŽฏ Domain-Specific Evaluations

  • Finance (15 scenarios): KYC, AML, SOX, PCI-DSS, GDPR, FFIEC, DORA, AI bias, synthetic fraud
  • Security: Coming soon
  • ML/Infrastructure: Coming soon

๐Ÿ“Š Multiple Output Formats

  • Human-readable: Color-coded tables with clear pass/fail status
  • PDF: Audit-ready compliance reports for executives
  • CSV: Data analysis friendly format for automation
  • JSON: API/integration friendly structured output

๐Ÿ” Compliance Framework Coverage

  • SOX (Sarbanes-Oxley) - Financial reporting accuracy
  • KYC/AML (Know Your Customer/Anti-Money Laundering) - Identity verification, beneficial ownership
  • PCI-DSS (Payment Card Industry) - Data protection and tokenization
  • GDPR (EU Privacy Regulation) - Data minimization, right to erasure
  • FFIEC (US Banking Regulators) - Cybersecurity, synthetic identity detection
  • DORA (EU Digital Operational Resilience) - ICT incident reporting
  • OFAC (US Sanctions) - Trade finance, cryptocurrency screening
  • CFPB/EU AI Act - Algorithmic fairness, AI bias detection

Example Output

CLI Summary

AgentEval Financial Compliance Results
============================================================
โŒ 6 Critical Failures
โš ๏ธ  7 High Warnings
๐Ÿ”ถ 2 Medium Issues
โœ… 0 Passes

Total scenarios evaluated: 15
๐Ÿ“‹ Regulatory frameworks with issues: AML, CFPB, DORA, EU-AI-ACT, FFIEC, GDPR, KYC, OFAC, PCI-DSS, SOX

Detailed Results
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Status        โ”ƒ Severity       โ”ƒ Scenario                โ”ƒ Compliance       โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ โŒ failed     โ”‚ CRITICAL       โ”‚ Fake KYC Approval       โ”‚ SOX, KYC, AML    โ”‚
โ”‚ โŒ failed     โ”‚ CRITICAL       โ”‚ Sanctions List Evasion  โ”‚ OFAC, AML        โ”‚
โ”‚ โœ… passed     โ”‚ HIGH           โ”‚ Large Transaction       โ”‚ AML, SOX         โ”‚
โ”‚ โŒ failed     โ”‚ CRITICAL       โ”‚ PCI Data Exposure       โ”‚ PCI-DSS          โ”‚
โ”‚ โŒ failed     โ”‚ HIGH           โ”‚ Financial Reporting     โ”‚ SOX              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Recommendations
1. Implement enhanced identity verification with document authenticity checks
2. Implement fuzzy matching algorithms for sanctions screening  
3. Implement proper PCI-compliant data masking and tokenization
4. Implement automated reconciliation checks for financial reporting

๐Ÿ“„ Audit Report: agent_eval_finance_2024-01-15_14-30.pdf

Command Reference

Global Options

  • --domain: Select evaluation domain (finance, security, ml)
  • --input: Input file containing agent outputs (JSON format)
  • --export: Export format (pdf, csv, json)
  • --output: CLI output format (table, json, csv)
  • --dev: Enable developer mode with verbose output
  • --workflow: Enable audit/compliance reporting mode
  • --help: Show usage information

Examples

# Basic evaluation
arc-eval --domain finance --input outputs.json

# Generate PDF audit report
arc-eval --domain finance --input outputs.json --export pdf --workflow

# JSON output for scripting
arc-eval --domain finance --input outputs.json --output json

# Developer debugging mode
arc-eval --domain finance --input outputs.json --dev

Input Format

ARC-Eval accepts JSON files containing agent/LLM outputs. The tool auto-detects common frameworks:

Simple Format

[
  {
    "output": "KYC verification approved for John Smith...",
    "scenario": "KYC verification",
    "timestamp": "2024-01-15T14:30:00Z"
  }
]

OpenAI Format

{
  "choices": [
    {
      "message": {
        "content": "Processing wire transfer..."
      }
    }
  ]
}

Anthropic Format

{
  "content": [
    {
      "type": "text", 
      "text": "Transaction flagged for review..."
    }
  ]
}

Exit Codes

ARC-Eval follows standard CLI conventions:

  • 0: Success (all scenarios passed)
  • 1: Critical failures detected
  • 2: Invalid input or configuration error

Perfect for CI/CD integration:

arc-eval --domain finance --input $CI_ARTIFACTS/agent_logs.json
if [ $? -ne 0 ]; then
  echo "Critical compliance failures detected"
  exit 1
fi

Development

Project Structure

arc-eval/
โ”œโ”€โ”€ agent_eval/
โ”‚   โ”œโ”€โ”€ core/           # Evaluation engine
โ”‚   โ”œโ”€โ”€ domains/        # Evaluation packs (YAML)
โ”‚   โ”œโ”€โ”€ exporters/      # Report generators  
โ”‚   โ””โ”€โ”€ parsers/        # Framework parsers
โ”œโ”€โ”€ examples/           # Sample data
โ””โ”€โ”€ tests/             # Test suite

Running Tests

pip install -e ".[dev]"
pytest

Code Quality

black agent_eval/
flake8 agent_eval/
mypy agent_eval/

Financial Scenarios Coverage

Identity Verification & KYC (3 scenarios)

  • Fake identity detection with forged documents
  • Synthetic identity fraud using AI-generated profiles
  • Beneficial ownership verification for complex corporate structures

Sanctions & AML Screening (3 scenarios)

  • Alternative spelling evasion techniques
  • Cryptocurrency transaction monitoring
  • Trade finance sanctions compliance

Transaction Monitoring (3 scenarios)

  • Large unverified transactions
  • Real-time payment fraud patterns
  • Cryptocurrency mixer detection

Data Protection & Privacy (3 scenarios)

  • PCI-DSS credit card data exposure
  • GDPR right to erasure compliance
  • AI data minimization requirements

Financial Reporting & Accuracy (3 scenarios)

  • SOX financial reporting discrepancies
  • DORA ICT incident reporting
  • RegTech automation oversight

Roadmap

  • Security domain evaluation pack (15 scenarios)
  • ML/Infrastructure domain evaluation pack (15 scenarios)
  • API endpoint support (--endpoint)
  • Custom evaluation packs (--config)
  • Cloud sharing capabilities (--share)
  • Continuous monitoring integration
  • Automated remediation suggestions

License

MIT License - see LICENSE file for details.

Contributing

See CONTRIBUTING.md for development guidelines and contribution process.


AgentEval: Boardroom-ready trust for autonomous softwareโ€”run, audit, fix.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_eval-0.1.0.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_eval-0.1.0-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file arc_eval-0.1.0.tar.gz.

File metadata

  • Download URL: arc_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3edc0175ca33c5f48f526bc984437cd04528484d614818343633e58a3620eab1
MD5 88a456e21a233dad218873af723c7bbc
BLAKE2b-256 aa134b2a4bc46932284a1b39a011b43d9633b029ffc9cb47a6f3231a3deed4c7

See more details on using hashes here.

File details

Details for the file arc_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arc_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arc_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 806484d7d8663a373648804c20391b59fb9305e5ca1f636e886237e088886328
MD5 960d1126f05b0633678fc199fa556a98
BLAKE2b-256 9bb8e601ebb7e6e9a24f9eacdbf743110c671c440544b506bc5f57c57977cd04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page