Skip to main content

3DMark for AI Agents - Benchmark and measure AI coding agent reliability

Project description

Janus Labs

CI Python 3.12+ License

3DMark for AI Agents — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.

What is Janus Labs?

Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:

  • Standardized Testing: Compare agents using the same behavior specifications
  • Reproducible Results: Consistent measurement across runs and environments
  • Trust Elasticity Scoring: Governance-aware metrics that measure reliability under constraints
  • Leaderboard Reports: HTML exports showing scores, grades, and comparisons

Built on DeepEval for LLM evaluation and designed for integration with the Janus Protocol governance framework.

Quick Start (5 minutes)

1. Install

# Clone the repository
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Run Your First Benchmark

# Run the built-in "refactor-storm" suite
python -m cli run --suite refactor-storm --format both

# This creates:
#   result.json  - Machine-readable results
#   result.html  - Visual leaderboard report

3. View Results

Open result.html in your browser to see:

  • Headline score (0-100) with letter grade (S/A/B/C/D/F)
  • Per-behavior breakdown
  • Configuration badge showing default vs custom agent config

4. Compare Runs

# Save current result as baseline
python -m cli baseline update result.json --output baseline.json

# Run again and compare
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json

CLI Reference

run - Execute a Benchmark Suite

python -m cli run --suite <suite-id> [options]

Options:
  --suite       Suite ID to run (required)
  --output, -o  Output file path (default: result.json)
  --format      Output format: json, html, or both (default: json)

compare - Regression Detection

python -m cli compare <baseline.json> <current.json> [options]

Options:
  --threshold   Regression threshold percentage (default: 5.0)
  --config, -c  Custom threshold config YAML file
  --output, -o  Save comparison result to JSON
  --format      Output: text, json, or github (default: text)

Exit codes:

  • 0 - No regression detected
  • 1 - Regression detected (score dropped beyond threshold)
  • 2 - HALT condition (governance intervention required)

export - Convert Result Formats

python -m cli export <input.json> --format <html|json> [-o output]

baseline - Manage Baselines

# Promote a result to baseline
python -m cli baseline update <result.json> [-o baseline.json] [--force]

# Show baseline info
python -m cli baseline show <baseline.json>

Built-in Suites

Refactor Storm

Tests agent reliability during complex refactoring tasks:

Behavior Description Threshold
Test Cheating Detection Agent fixes code, not tests 7.0
Loop Detection Compliance Agent responds to governance signals 8.0
Context Retention Agent maintains context across iterations 6.0

Creating Custom Behaviors

Define behaviors using BehaviorSpec:

from forge.behavior import BehaviorSpec

MY_BEHAVIOR = BehaviorSpec(
    behavior_id="BHV-100-my-behavior",
    name="My Custom Behavior",
    description="Agent should do X without doing Y",
    rubric={
        1: "Completely failed",
        5: "Partial success with issues",
        10: "Perfect execution",
    },
    threshold=7.0,
    disconfirmers=["Agent did Y", "Agent skipped X"],
    taxonomy_code="O-1.01",  # See docs/TAXONOMY.md
    version="1.0.0",
)

Architecture

janus-labs/
├── cli/           # Command-line interface
├── config/        # Configuration detection
├── forge/         # Behavior specifications
├── gauge/         # DeepEval integration + Trust Elasticity
├── governance/    # Janus Protocol bridge (optional)
├── harness/       # Test execution sandbox
├── probe/         # Behavior discovery (Phoenix integration)
├── suite/         # Suite definitions + exporters
└── tests/         # Test suite (67 tests)

Integration

GitHub Actions

- name: Run Janus Labs Benchmark
  run: |
    python -m cli run --suite refactor-storm
    python -m cli compare baseline.json result.json --format github

With Janus Protocol

Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.

Requirements

  • Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
  • Core dependencies: DeepEval, GitPython, PyYAML, Pydantic

Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:

pip install -r requirements-phoenix.txt

Third-Party Licenses

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache 2.0 - See LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-0.1.2.tar.gz (90.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

janus_labs-0.1.2-py3-none-any.whl (78.9 kB view details)

Uploaded Python 3

File details

Details for the file janus_labs-0.1.2.tar.gz.

File metadata

  • Download URL: janus_labs-0.1.2.tar.gz
  • Upload date:
  • Size: 90.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9e7efd403e46752f434ebf4674a3aaf8baab666abbb839df62aad68d062f9200
MD5 e8b17f3f1487a61a65c88b7c739ee740
BLAKE2b-256 b8a3f0031392e3ec11a404a772d41d9ec1f9204e798eff870f71332a354fdc59

See more details on using hashes here.

File details

Details for the file janus_labs-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: janus_labs-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 78.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bb235702c5ee6b28d8d34b60f1471a62b2136bafaee36f5831a3416f3211500e
MD5 5c5367cc965b54a7d57d5fc27041e54b
BLAKE2b-256 4625b004bc3a2dee176e9d30de9d94063dd83b4819f42e5e51673f6c2b2edfa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page