3DMark for AI Agents - Benchmark and measure AI coding agent reliability
Project description
Janus Labs
3DMark for AI Agents — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.
What is Janus Labs?
Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:
- Standardized Testing: Compare agents using the same behavior specifications
- Reproducible Results: Consistent measurement across runs and environments
- Trust Elasticity Scoring: Governance-aware metrics that measure reliability under constraints
- Leaderboard Reports: HTML exports showing scores, grades, and comparisons
Built on DeepEval for LLM evaluation and designed for integration with the Janus Protocol governance framework.
Quick Start (5 minutes)
1. Install
# Clone the repository
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
2. Run Your First Benchmark
# Run the built-in "refactor-storm" suite
python -m cli run --suite refactor-storm --format both
# This creates:
# result.json - Machine-readable results
# result.html - Visual leaderboard report
3. View Results
Open result.html in your browser to see:
- Headline score (0-100) with letter grade (S/A/B/C/D/F)
- Per-behavior breakdown
- Configuration badge showing default vs custom agent config
4. Compare Runs
# Save current result as baseline
python -m cli baseline update result.json --output baseline.json
# Run again and compare
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json
CLI Reference
run - Execute a Benchmark Suite
python -m cli run --suite <suite-id> [options]
Options:
--suite Suite ID to run (required)
--output, -o Output file path (default: result.json)
--format Output format: json, html, or both (default: json)
compare - Regression Detection
python -m cli compare <baseline.json> <current.json> [options]
Options:
--threshold Regression threshold percentage (default: 5.0)
--config, -c Custom threshold config YAML file
--output, -o Save comparison result to JSON
--format Output: text, json, or github (default: text)
Exit codes:
0- No regression detected1- Regression detected (score dropped beyond threshold)2- HALT condition (governance intervention required)
export - Convert Result Formats
python -m cli export <input.json> --format <html|json> [-o output]
baseline - Manage Baselines
# Promote a result to baseline
python -m cli baseline update <result.json> [-o baseline.json] [--force]
# Show baseline info
python -m cli baseline show <baseline.json>
Built-in Suites
Refactor Storm
Tests agent reliability during complex refactoring tasks:
| Behavior | Description | Threshold |
|---|---|---|
| Test Cheating Detection | Agent fixes code, not tests | 7.0 |
| Loop Detection Compliance | Agent responds to governance signals | 8.0 |
| Context Retention | Agent maintains context across iterations | 6.0 |
Creating Custom Behaviors
Define behaviors using BehaviorSpec:
from forge.behavior import BehaviorSpec
MY_BEHAVIOR = BehaviorSpec(
behavior_id="BHV-100-my-behavior",
name="My Custom Behavior",
description="Agent should do X without doing Y",
rubric={
1: "Completely failed",
5: "Partial success with issues",
10: "Perfect execution",
},
threshold=7.0,
disconfirmers=["Agent did Y", "Agent skipped X"],
taxonomy_code="O-1.01", # See docs/TAXONOMY.md
version="1.0.0",
)
Architecture
janus-labs/
├── cli/ # Command-line interface
├── config/ # Configuration detection
├── forge/ # Behavior specifications
├── gauge/ # DeepEval integration + Trust Elasticity
├── governance/ # Janus Protocol bridge (optional)
├── harness/ # Test execution sandbox
├── probe/ # Behavior discovery (Phoenix integration)
├── suite/ # Suite definitions + exporters
└── tests/ # Test suite (67 tests)
Integration
GitHub Actions
- name: Run Janus Labs Benchmark
run: |
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json --format github
With Janus Protocol
Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.
Requirements
- Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
- Core dependencies: DeepEval, GitPython, PyYAML, Pydantic
Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
pip install -r requirements-phoenix.txt
Third-Party Licenses
- DeepEval - Apache 2.0
- Arize Phoenix - Elastic License 2.0
Contributing
See CONTRIBUTING.md for guidelines.
License
Apache 2.0 - See LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file janus_labs-0.1.0.tar.gz.
File metadata
- Download URL: janus_labs-0.1.0.tar.gz
- Upload date:
- Size: 79.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01d8b81cee650f99f0bc2b697efb04b725f657c52e134b837cbbf391bb3f1454
|
|
| MD5 |
9fa76dee00334002fd507ae4e58e4068
|
|
| BLAKE2b-256 |
82100df60684ea36ad8f4b1a6d72edc394e6b474547e333654b753787e2856f7
|
File details
Details for the file janus_labs-0.1.0-py3-none-any.whl.
File metadata
- Download URL: janus_labs-0.1.0-py3-none-any.whl
- Upload date:
- Size: 63.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7715203dc64a53e0be97058c49731580d4b235a0bffec3ba3d655ac9236b1ed8
|
|
| MD5 |
9d874514c013e307755adfab9b743a52
|
|
| BLAKE2b-256 |
3cb022b1c7ba06c2d9fa9643f40b014221637772ead0a67bb238db8316b8e323
|