3DMark for AI Agents - Benchmark and measure AI coding agent reliability
Project description
Janus Labs
3DMark for AI Agents — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.
What is Janus Labs?
Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:
- Standardized Testing: Compare agents using the same behavior specifications
- Reproducible Results: Consistent measurement across runs and environments
- Trust Elasticity Scoring: Governance-aware metrics that measure reliability under constraints
- Leaderboard Reports: HTML exports showing scores, grades, and comparisons
Built on DeepEval for LLM evaluation and designed for integration with the Janus Protocol governance framework.
Quick Start
Option A: Install from PyPI (Recommended)
pip install janus-labs
Then run:
# If janus-labs is in your PATH:
janus-labs bench --suite refactor-storm
# If not in PATH (common on Windows), use module syntax:
python -m janus_labs bench --suite refactor-storm
Windows PATH Troubleshooting
On Windows, pip install --user installs executables to a location not in PATH by default:
%APPDATA%\Python\Python3XX\Scripts
Option 1: Use module syntax (no PATH changes needed)
python -m janus_labs bench --suite refactor-storm
Option 2 - Add Scripts to PATH:
# Find your Scripts folder
python -c "import sysconfig; print(sysconfig.get_path('scripts'))"
# Add that path to your system PATH environment variable
Option 3 - Use a virtual environment:
python -m venv .venv
.venv\Scripts\activate
pip install janus-labs
janus-labs bench # Works because venv Scripts is in activated PATH
Option B: Install from Source (Development)
# Clone the repository
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in editable mode
pip install -e .
Run Your First Benchmark
# Zero-friction benchmark (detects your config, scores, shows result)
janus-labs bench
# Or with module syntax:
python -m janus_labs bench
# Full suite run with HTML report:
janus-labs run --suite refactor-storm --format both
# This creates:
# result.json - Machine-readable results
# result.html - Visual leaderboard report
3. View Results
Open result.html in your browser to see:
- Headline score (0-100) with letter grade (S/A/B/C/D/F)
- Per-behavior breakdown
- Configuration badge showing default vs custom agent config
4. Compare Runs
# Save current result as baseline
python -m cli baseline update result.json --output baseline.json
# Run again and compare
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json
CLI Reference
All commands can be run as janus-labs <command> or python -m janus_labs <command>.
bench - Zero-Friction Benchmark (Recommended)
janus-labs bench [options]
Options:
--suite Suite ID (default: refactor-storm)
--behavior Behavior ID (default: BHV-001-test-cheating)
--submit Submit results to public leaderboard
--github GitHub handle for attribution (requires --submit)
--model LLM model for judge scoring (default: gpt-4o)
--no-copy Don't copy share URL to clipboard
run - Execute a Benchmark Suite
janus-labs run --suite <suite-id> [options]
Options:
--suite Suite ID to run (required)
--output, -o Output file path (default: result.json)
--format Output format: json, html, or both (default: json)
--judge Use LLM-as-judge scoring (slower but differentiates)
--model LLM model for judge scoring (default: gpt-4o)
submit - Submit to Leaderboard
janus-labs submit <result.json> [options]
Options:
--dry-run Show payload without submitting
--github GitHub handle for attribution
compare - Regression Detection
janus-labs compare <baseline.json> <current.json> [options]
Options:
--threshold Regression threshold percentage (default: 5.0)
--config, -c Custom threshold config YAML file
--output, -o Save comparison result to JSON
--format Output: text, json, or github (default: text)
Exit codes:
0- No regression detected1- Regression detected (score dropped beyond threshold)2- HALT condition (governance intervention required)
export - Convert Result Formats
janus-labs export <input.json> --format <html|json> [-o output]
baseline - Manage Baselines
# Promote a result to baseline
janus-labs baseline update <result.json> [-o baseline.json] [--force]
# Show baseline info
janus-labs baseline show <baseline.json>
Built-in Suites
Refactor Storm
Tests agent reliability during complex refactoring tasks:
| Behavior | Description | Threshold |
|---|---|---|
| Test Cheating Detection | Agent fixes code, not tests | 7.0 |
| Loop Detection Compliance | Agent responds to governance signals | 8.0 |
| Context Retention | Agent maintains context across iterations | 6.0 |
Creating Custom Behaviors
Define behaviors using BehaviorSpec:
from forge.behavior import BehaviorSpec
MY_BEHAVIOR = BehaviorSpec(
behavior_id="BHV-100-my-behavior",
name="My Custom Behavior",
description="Agent should do X without doing Y",
rubric={
1: "Completely failed",
5: "Partial success with issues",
10: "Perfect execution",
},
threshold=7.0,
disconfirmers=["Agent did Y", "Agent skipped X"],
taxonomy_code="O-1.01", # See docs/TAXONOMY.md
version="1.0.0",
)
Architecture
janus-labs/
├── janus_labs/ # Python package (for python -m janus_labs)
├── cli/ # Command-line interface
├── config/ # Configuration detection
├── forge/ # Behavior specifications
├── gauge/ # DeepEval integration + Trust Elasticity
├── governance/ # Janus Protocol bridge (optional)
├── harness/ # Test execution sandbox
├── probe/ # Behavior discovery (Phoenix integration)
├── scaffold/ # Task workspace templates
├── suite/ # Suite definitions + exporters
└── tests/ # Test suite
Integration
GitHub Actions
- name: Run Janus Labs Benchmark
run: |
pip install janus-labs
janus-labs run --suite refactor-storm
janus-labs compare baseline.json result.json --format github
With Janus Protocol
Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.
Requirements
- Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
- Core dependencies: DeepEval, GitPython, PyYAML, Pydantic
Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
pip install -r requirements-phoenix.txt
Third-Party Licenses
- DeepEval - Apache 2.0
- Arize Phoenix - Elastic License 2.0
Contributing
See CONTRIBUTING.md for guidelines.
License
Apache 2.0 - See LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file janus_labs-0.2.0.tar.gz.
File metadata
- Download URL: janus_labs-0.2.0.tar.gz
- Upload date:
- Size: 97.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fda450882471a0cb62f6aa6b52624c89a633677c48fcca27cc861fa69c8db456
|
|
| MD5 |
05551e139851dc0ac64fbd6875b97d1b
|
|
| BLAKE2b-256 |
f6dd3eddbc226ddacf1f77c800b6bc190cacf64310639dea0bff56eb282c9980
|
File details
Details for the file janus_labs-0.2.0-py3-none-any.whl.
File metadata
- Download URL: janus_labs-0.2.0-py3-none-any.whl
- Upload date:
- Size: 88.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2aa30f0b59fc00133aab845babde7d2c509fc1b492bb092fb5953eb29fd2b0b6
|
|
| MD5 |
903170db1493222fd99c83641c6cb6aa
|
|
| BLAKE2b-256 |
7ede40512a559237669ae09fe62469a30e9ac98c831e9e376bc0eaf318e69dcd
|