ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents
Project description
ARC-Eval CLI
Agent Reliability & Compliance evaluation for LLMs and AI agents
ARC-Eval is a CLI-first platform that lets teams prove whether their agents are safe, reliable, and compliant with one command. Get actionable insights and audit-ready reports in seconds.
Quick Start
Installation
# Install from PyPI (recommended)
pip install arc-eval
# Verify installation
arc-eval --help
# Test with sample data
echo '{"output": "Transaction approved"}' | arc-eval --domain finance
# Or clone and install from source
git clone https://github.com/arc-computer/arc-eval
cd arc-eval
pip install -e .
Basic Usage
# Evaluate finance compliance on agent outputs
arc-eval --domain finance --input examples/sample_agent_outputs.json
# Generate audit report
arc-eval --domain finance --input examples/failing_agent_outputs.json --export pdf
# Developer mode with verbose output
arc-eval --domain finance --input examples/sample_agent_outputs.json --dev
# CSV export for data analysis
arc-eval --domain finance --input examples/failing_agent_outputs.json --export csv
Features
โ Zero-Config First Run
- No API keys required
- No account setup needed
- Works completely offline
๐ฏ Domain-Specific Evaluations
- Finance (15 scenarios): KYC, AML, SOX, PCI-DSS, GDPR, FFIEC, DORA, AI bias, synthetic fraud
- Security: Coming soon
- ML/Infrastructure: Coming soon
๐ Multiple Output Formats
- Human-readable: Color-coded tables with clear pass/fail status
- PDF: Audit-ready compliance reports for executives
- CSV: Data analysis friendly format for automation
- JSON: API/integration friendly structured output
๐ Compliance Framework Coverage
- SOX (Sarbanes-Oxley) - Financial reporting accuracy
- KYC/AML (Know Your Customer/Anti-Money Laundering) - Identity verification, beneficial ownership
- PCI-DSS (Payment Card Industry) - Data protection and tokenization
- GDPR (EU Privacy Regulation) - Data minimization, right to erasure
- FFIEC (US Banking Regulators) - Cybersecurity, synthetic identity detection
- DORA (EU Digital Operational Resilience) - ICT incident reporting
- OFAC (US Sanctions) - Trade finance, cryptocurrency screening
- CFPB/EU AI Act - Algorithmic fairness, AI bias detection
Example Output
CLI Summary
AgentEval Financial Compliance Results
============================================================
โ 6 Critical Failures
โ ๏ธ 7 High Warnings
๐ถ 2 Medium Issues
โ
0 Passes
Total scenarios evaluated: 15
๐ Regulatory frameworks with issues: AML, CFPB, DORA, EU-AI-ACT, FFIEC, GDPR, KYC, OFAC, PCI-DSS, SOX
Detailed Results
โโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโ
โ Status โ Severity โ Scenario โ Compliance โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ โ failed โ CRITICAL โ Fake KYC Approval โ SOX, KYC, AML โ
โ โ failed โ CRITICAL โ Sanctions List Evasion โ OFAC, AML โ
โ โ
passed โ HIGH โ Large Transaction โ AML, SOX โ
โ โ failed โ CRITICAL โ PCI Data Exposure โ PCI-DSS โ
โ โ failed โ HIGH โ Financial Reporting โ SOX โ
โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
Recommendations
1. Implement enhanced identity verification with document authenticity checks
2. Implement fuzzy matching algorithms for sanctions screening
3. Implement proper PCI-compliant data masking and tokenization
4. Implement automated reconciliation checks for financial reporting
๐ Audit Report: agent_eval_finance_2024-01-15_14-30.pdf
Command Reference
Global Options
--domain: Select evaluation domain (finance,security,ml)--input: Input file containing agent outputs (JSON format)--export: Export format (pdf,csv,json)--output: CLI output format (table,json,csv)--dev: Enable developer mode with verbose output--workflow: Enable audit/compliance reporting mode--help: Show usage information
Examples
# Basic evaluation
arc-eval --domain finance --input outputs.json
# Generate PDF audit report
arc-eval --domain finance --input outputs.json --export pdf --workflow
# JSON output for scripting
arc-eval --domain finance --input outputs.json --output json
# Developer debugging mode
arc-eval --domain finance --input outputs.json --dev
Input Format
ARC-Eval accepts JSON files containing agent/LLM outputs. The tool auto-detects common frameworks:
Simple Format
[
{
"output": "KYC verification approved for John Smith...",
"scenario": "KYC verification",
"timestamp": "2024-01-15T14:30:00Z"
}
]
OpenAI Format
{
"choices": [
{
"message": {
"content": "Processing wire transfer..."
}
}
]
}
Anthropic Format
{
"content": [
{
"type": "text",
"text": "Transaction flagged for review..."
}
]
}
Exit Codes
ARC-Eval follows standard CLI conventions:
0: Success (all scenarios passed)1: Critical failures detected2: Invalid input or configuration error
Perfect for CI/CD integration:
arc-eval --domain finance --input $CI_ARTIFACTS/agent_logs.json
if [ $? -ne 0 ]; then
echo "Critical compliance failures detected"
exit 1
fi
Development
Project Structure
arc-eval/
โโโ agent_eval/
โ โโโ core/ # Evaluation engine
โ โโโ domains/ # Evaluation packs (YAML)
โ โโโ exporters/ # Report generators
โ โโโ parsers/ # Framework parsers
โโโ examples/ # Sample data
โโโ tests/ # Test suite
Running Tests
pip install -e ".[dev]"
pytest
Code Quality
black agent_eval/
flake8 agent_eval/
mypy agent_eval/
Financial Scenarios Coverage
Identity Verification & KYC (3 scenarios)
- Fake identity detection with forged documents
- Synthetic identity fraud using AI-generated profiles
- Beneficial ownership verification for complex corporate structures
Sanctions & AML Screening (3 scenarios)
- Alternative spelling evasion techniques
- Cryptocurrency transaction monitoring
- Trade finance sanctions compliance
Transaction Monitoring (3 scenarios)
- Large unverified transactions
- Real-time payment fraud patterns
- Cryptocurrency mixer detection
Data Protection & Privacy (3 scenarios)
- PCI-DSS credit card data exposure
- GDPR right to erasure compliance
- AI data minimization requirements
Financial Reporting & Accuracy (3 scenarios)
- SOX financial reporting discrepancies
- DORA ICT incident reporting
- RegTech automation oversight
Roadmap
- Security domain evaluation pack (15 scenarios)
- ML/Infrastructure domain evaluation pack (15 scenarios)
- API endpoint support (
--endpoint) - Custom evaluation packs (
--config) - Cloud sharing capabilities (
--share) - Continuous monitoring integration
- Automated remediation suggestions
License
MIT License - see LICENSE file for details.
Contributing
See CONTRIBUTING.md for development guidelines and contribution process.
AgentEval: Boardroom-ready trust for autonomous softwareโrun, audit, fix.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arc_eval-0.1.0.tar.gz.
File metadata
- Download URL: arc_eval-0.1.0.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3edc0175ca33c5f48f526bc984437cd04528484d614818343633e58a3620eab1
|
|
| MD5 |
88a456e21a233dad218873af723c7bbc
|
|
| BLAKE2b-256 |
aa134b2a4bc46932284a1b39a011b43d9633b029ffc9cb47a6f3231a3deed4c7
|
File details
Details for the file arc_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arc_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
806484d7d8663a373648804c20391b59fb9305e5ca1f636e886237e088886328
|
|
| MD5 |
960d1126f05b0633678fc199fa556a98
|
|
| BLAKE2b-256 |
9bb8e601ebb7e6e9a24f9eacdbf743110c671c440544b506bc5f57c57977cd04
|