3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and context retention
Project description
Janus Labs
3DMark for AI Agents — Profile your AI coding agent's capabilities across code quality, error resilience, and context retention.
What is Janus Labs?
Janus Labs profiles AI coding agents across multiple capability axes — similar to how 3DMark benchmarks GPUs across physics, graphics, and compute. Instead of a single score, you get a radar chart showing where your agent excels and where it struggles.
- Multi-Axis Profiling: Code Quality, Error Resilience, and Context Retention measured independently
- Radar Chart Fingerprints: Each agent produces a unique capability shape — flat scores hide real differences
- 14 Pre-computed Baselines: Compare against Claude, GPT, Gemini, and Copilot across multiple models
- Reproducible Results: Docker-isolated, standardized test suites with GEval LLM-judge scoring
Built on DeepEval for LLM evaluation. View results at janus-labs.dev.
Quick Start
Install
pip install janus-labs
Verify installation:
janus-labs --version # Shows: janus-labs 0.6.8
Troubleshooting: If
janus-labsisn't found, usepython -m janus_labs(underscore, not hyphen). To find the install path:pip show janus-labs. Bothjanus-labsandjanuscommands work identically.
Interactive Mode (New in v0.6.0)
Just run janus-labs with no arguments for a guided menu:
janus-labs
# ============================================================
# Janus Labs - 3DMark for AI Agents
# ============================================================
#
# What would you like to do?
# [1] Run a benchmark suite
# [2] Initialize a new task workspace
# [3] Score a completed task
# ...
Run Your First Benchmark
Janus Labs tests your actual agent on real coding tasks — then profiles the results across capability axes.
# Step 1: Initialize a benchmark task
cd your-project # Directory with your CLAUDE.md or agent config
janus-labs init --behavior BHV-002 # Prefix matching: BHV-002 → BHV-002-refactor-complexity
# Or run interactively:
janus-labs init # Shows menu of available behaviors
# This creates a task workspace:
# src/calculator.py - Starter code with a bug
# tests/test_calc.py - Tests that currently fail
# .janus-task.json - Task metadata
# README.md - Instructions for your agent
# Step 2: Let your AI agent solve it
# Use Claude Code, Cursor, Copilot, Windsurf, or any AI coding assistant
# Ask your agent: "Fix the bug in calculator.py so tests pass"
# Step 3: Score the result
janus-labs score
# Captures REAL git diffs and runs REAL pytest
# Output:
# Score: 83.6 (Grade A)
# Config: CLAUDE.md (hash: a1b2c3d4)
# Behaviors: Test integrity preserved ✓
# Step 4: Submit to leaderboard (optional)
janus-labs submit result.json --github your-handle
View Your Agent's Profile
After scoring, generate a capability profile to see your agent's strengths and blindspots:
# Generate radar profile from baselines
janus-labs profile --baselines-dir data/baselines
# Output:
# Agent/Model Composite Code Q Err Res Context Grade
# -------------------------------------------------------------------------
# claude/claude-sonnet-4-5 87.3 90.9 83.4 87.6 A
# codex/gpt-4o 82.4 75.7 83.4 88.2 B
# gemini/gemini-2.5-pro 44.6 43.7 0.0 90.2 F
# Compare your result against the vanilla baseline
janus-labs compare result.json --auto-baseline
Alternative: Install from Source
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .
CLI Reference
All commands can be run as:
janus-labs <command>(full name)janus <command>(short alias)python -m janus_labs <command>(module invocation)
Global Options
janus-labs --version # Show version number
janus-labs --help # Show help
janus-labs # Launch interactive menu (no args)
Suite Shortcuts (New in v0.6.0)
Run suites directly without run --suite:
janus-labs refactor-storm # Same as: janus-labs run --suite refactor-storm
janus-labs refactor-storm --mock # With mock scoring
init - Initialize Benchmark Task (Start Here)
janus-labs init [options]
Options:
--behavior Behavior ID or prefix (interactive if omitted)
--suite Suite ID for full suite (default: refactor-storm)
--output, -o Output directory for task workspace
# Creates a git-initialized workspace with:
# - Starter code with intentional issues
# - Test files that validate the fix
# - Task metadata (.janus-task.json)
# - .gitignore (auto-excludes result.json)
Features:
- Interactive mode: Run
janus-labs initwithout--behaviorto see a menu - Prefix matching:
--behavior BHV-002matchesBHV-002-refactor-complexity - Actionable errors: All errors include "Try:" hints with example commands
status - Check Workspace Status
janus-labs status [options]
Options:
--workspace, -w Path to workspace (default: current directory)
# Shows:
# - Current behavior and suite
# - Git status (committed vs uncommitted changes)
# - Next step recommendation
score - Score Completed Task
janus-labs score [options]
Options:
--judge Use LLM-as-judge for additional scoring (requires API key)
--model LLM model for judge scoring (default: gpt-4o)
--output, -o Output file path (default: result.json)
# Evaluates your agent's work by:
# - Capturing git diffs since init
# - Running pytest on the test files
# - Checking behavior-specific rules (e.g., test cheating detection)
submit - Submit to Leaderboard
janus-labs submit <result.json> [options]
Options:
--dry-run Show payload without submitting
--github GitHub handle for attribution
Zero friction - no API key required for public leaderboard. Anti-cheat is handled via workspace hash validation.
compare - Regression Detection
janus-labs compare <baseline.json> <current.json> [options]
Options:
--threshold Regression threshold percentage (default: 5.0)
--config, -c Custom threshold config YAML file
--output, -o Save comparison result to JSON
--format Output: text, json, or github (default: text)
Exit codes:
0- No regression detected1- Regression detected (score dropped beyond threshold)2- HALT condition (governance intervention required)
run - Execute Full Suite (Advanced)
janus-labs run --suite <suite-id> [options]
Options:
--suite Suite ID to run (required)
--output, -o Output file path (default: result.json)
--format Output format: json, html, or both (default: json)
--judge Use LLM-as-judge scoring
--mock Use mock scoring (offline, deterministic)
--model LLM model for judge scoring (default: gpt-4o)
--no-interactive Disable prompts, auto-fallback on rate limit (for CI)
Rate Limit Resilience (v0.6.0): The backend judge now includes circuit breaker and exponential backoff. If rate limited, you'll be prompted to wait, switch to mock scoring, or abort.
smoke-test - Quick Validation
Note: This command uses mock data for quick validation. It does NOT test your CLAUDE.md or agent configuration. Use
init→scorefor real benchmarking.
janus-labs smoke-test [options]
Options:
--suite Suite ID (default: refactor-storm)
--behavior Behavior ID (default: BHV-001-test-cheating)
--submit Submit results to public leaderboard
--model LLM model for judge scoring (default: gpt-4o)
Deprecated:
janus-labs benchstill works but shows a deprecation warning.
export - Convert Result Formats
janus-labs export <input.json> --format <html|json> [-o output]
profile - Capability Profiling
# Generate radar profiles from baselines
janus-labs profile --baselines-dir data/baselines
# Single baseline profile
janus-labs profile --baseline data/baselines/baseline_claude_opus-4-6.json
# SVG radar leaderboard (top-5 overlay)
janus-labs profile --leaderboard
# K=3 reliability mode (variance-aware)
janus-labs profile --reliability
# JSON output for automation
janus-labs profile --baselines-dir data/baselines --json
diagnose - Instruction Health Check
# Compare configured scores vs vanilla baseline per-behavior
janus-labs diagnose result.json
# Detects instruction-behavior interference patterns
# Output: per-behavior delta, interference warnings, recommendations
baseline - Manage Baselines
# List all pre-computed baselines (14 agent/model combinations)
janus-labs baseline list
# Promote a result to baseline
janus-labs baseline update <result.json> [-o baseline.json] [--force]
# Show baseline info
janus-labs baseline show <baseline.json>
Built-in Suites
Refactor Storm (v1.3.0)
7 behaviors mapping to 3 capability axes:
| Behavior | Capability Axis | Description |
|---|---|---|
| BHV-001 Test Cheating | Code Quality | Agent fixes code, not tests |
| BHV-002 Refactor Complexity | Code Quality | Reduce cyclomatic complexity |
| BHV-003 Error Handling | Error Resilience | Add comprehensive error handling |
| BHV-004 Loop Detection | Error Resilience | Fix bugs without thrashing |
| BHV-005 Context Retention | Context Retention | Fix type inconsistency across 3 files |
| O-2.01 Instruction Adherence | — | Follow specific constraints |
| O-3.01 Code Quality | — | Produce clean, maintainable code |
Creating Custom Behaviors
Define behaviors using BehaviorSpec:
from forge.behavior import BehaviorSpec
MY_BEHAVIOR = BehaviorSpec(
behavior_id="BHV-100-my-behavior",
name="My Custom Behavior",
description="Agent should do X without doing Y",
rubric={
1: "Completely failed",
5: "Partial success with issues",
10: "Perfect execution",
},
threshold=7.0,
disconfirmers=["Agent did Y", "Agent skipped X"],
taxonomy_code="O-1.01", # See docs/TAXONOMY.md
version="1.0.0",
)
Architecture
janus-labs/
├── janus_labs/ # Python package (for python -m janus_labs)
├── cli/ # Command-line interface
├── config/ # Configuration detection
├── forge/ # Behavior specifications
├── gauge/ # DeepEval integration + Trust Elasticity
├── governance/ # Janus Protocol bridge (optional)
├── harness/ # Test execution sandbox
├── probe/ # Behavior discovery (Phoenix integration)
├── scaffold/ # Task workspace templates
├── suite/ # Suite definitions + exporters
└── tests/ # Test suite
VSCode Extension (New in v0.6.0)
A VSCode extension is available for command palette integration:
Features:
- Multi-step QuickPick flows for running benchmarks
- Status bar showing benchmark status
- Commands: Run Benchmark, Initialize Task, Score Task, Smoke Test
Installation: Build from source in vscode-extension/ directory:
cd vscode-extension
npm install
npm run compile
npm run package # Creates .vsix file
Install via: Extensions > ... > Install from VSIX
Integration
GitHub Actions
- name: Run Janus Labs Benchmark
run: |
pip install janus-labs
janus-labs run --suite refactor-storm
janus-labs compare baseline.json result.json --format github
CI/CD Regression Gating
Janus Labs provides deterministic pass/fail for CI pipelines with configurable per-behavior thresholds.
Baseline Workflow
# 1. Establish a baseline (first time or after intentional changes)
janus-labs run --suite refactor-storm -o result.json
janus-labs baseline update result.json -o baseline.json
git add baseline.json && git commit -m "Update baseline"
# 2. In CI: Compare against baseline
janus-labs run --suite refactor-storm -o current.json --no-interactive
janus-labs compare baseline.json current.json --format github
# Exit code: 0=pass, 1=regression, 2=error
# 3. Update baseline when scores improve
janus-labs baseline update current.json -o baseline.json --force
Exit Codes
| Code | Meaning | CI Action |
|---|---|---|
0 |
Pass - within thresholds | Continue pipeline |
1 |
Regression - score dropped beyond threshold | Fail build |
2 |
Error - incompatible results or HALT condition | Fail build, investigate |
Threshold Configuration
Create a thresholds.yaml for per-behavior regression limits:
# thresholds.yaml
suite_id: refactor-storm
default_max_regression_pct: 5.0 # Default: fail if score drops >5%
default_min_score: 60.0 # Optional: absolute minimum score
fail_on_any_halt: true # Fail if governance HALT triggered
behaviors:
BHV-001-test-cheating:
max_regression_pct: 3.0 # Stricter for critical behaviors
min_score: 70.0
required: true
BHV-002-loop-detection:
max_regression_pct: 10.0 # More lenient for experimental
required: false # Won't fail build if missing
BHV-003-context-retention:
max_regression_pct: 5.0
Use in CI:
janus-labs compare baseline.json current.json --config thresholds.yaml
Comparison JSON Output
The --output flag produces a JSON artifact for CI systems:
{
"suite_id": "refactor-storm",
"suite_version": "1.0.0",
"verdict": "pass",
"headline_baseline": 79.2,
"headline_current": 81.5,
"headline_delta_pct": 2.9,
"regressions": 0,
"warnings": 0,
"passes": 3,
"exit_code": 0,
"ci_message": "PASS: 0 regressions, 0 warnings, headline 81.5 (+2.9%)",
"behavior_comparisons": [
{
"behavior_id": "BHV-001-test-cheating",
"baseline_score": 79.3,
"current_score": 82.1,
"delta_pct": 3.5,
"threshold_pct": 5.0,
"verdict": "pass",
"message": "within thresholds"
}
]
}
GitHub Actions Full Example
name: Benchmark Regression
on:
push:
branches: [main]
pull_request:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Janus Labs
run: pip install janus-labs
- name: Run Benchmark
run: janus-labs run --suite refactor-storm -o current.json --no-interactive --mock
- name: Compare to Baseline
run: |
janus-labs compare baseline.json current.json \
--config thresholds.yaml \
--format github \
--output comparison.json
- name: Upload Artifacts
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: |
current.json
comparison.json
With Janus Protocol
Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.
Requirements
- Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
- Core dependencies: DeepEval, GitPython, PyYAML, Pydantic
Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
pip install -r requirements-phoenix.txt
Third-Party Licenses
- DeepEval - Apache 2.0
- Arize Phoenix - Elastic License 2.0
Contributing
See CONTRIBUTING.md for guidelines.
License
Apache 2.0 - See LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file janus_labs-0.8.4.tar.gz.
File metadata
- Download URL: janus_labs-0.8.4.tar.gz
- Upload date:
- Size: 188.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
943f197493f3e40deca360d237e8ad3a7e6f9f67223a9129f6336b7c6e53ace1
|
|
| MD5 |
73eba3f15e8be2898b8eebe5d00fbd5b
|
|
| BLAKE2b-256 |
4e733f220465cc1afbf9611c2a74f39bb0560f0a90fb5fbaad3d0750099534e0
|
File details
Details for the file janus_labs-0.8.4-py3-none-any.whl.
File metadata
- Download URL: janus_labs-0.8.4-py3-none-any.whl
- Upload date:
- Size: 178.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c88e349419458e75bca08a305512b229b04da37b4cc2fa6afda854d53cdc8bc0
|
|
| MD5 |
1d5dc6d999bc271df473f5999969d83c
|
|
| BLAKE2b-256 |
09f7ea1bccc44aa0260da7efbca6d0baeaef6d2f3c7c9739d841280755b47297
|