3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and context retention

These details have not been verified by PyPI

Project links

Project description

Janus Labs

3DMark for AI Agents — Profile your AI coding agent's capabilities across code quality, error resilience, and context retention.

What is Janus Labs?

Janus Labs profiles AI coding agents across multiple capability axes — similar to how 3DMark benchmarks GPUs across physics, graphics, and compute. Instead of a single score, you get a radar chart showing where your agent excels and where it struggles.

Multi-Axis Profiling: Code Quality, Error Resilience, and Context Retention measured independently
Radar Chart Fingerprints: Each agent produces a unique capability shape — flat scores hide real differences
14 Pre-computed Baselines: Compare against Claude, GPT, Gemini, and Copilot across multiple models
Reproducible Results: Docker-isolated, standardized test suites with GEval LLM-judge scoring

Built on DeepEval for LLM evaluation. View results at janus-labs.dev.

Quick Start

Install

pip install janus-labs

Verify installation:

janus-labs --version  # Shows: janus-labs 0.6.8

Troubleshooting: If janus-labs isn't found, use python -m janus_labs (underscore, not hyphen). To find the install path: pip show janus-labs. Both janus-labs and janus commands work identically.

Interactive Mode (New in v0.6.0)

Just run janus-labs with no arguments for a guided menu:

janus-labs
# ============================================================
#   Janus Labs - 3DMark for AI Agents
# ============================================================
#
# What would you like to do?
#   [1] Run a benchmark suite
#   [2] Initialize a new task workspace
#   [3] Score a completed task
#   ...

Run Your First Benchmark

Janus Labs tests your actual agent on real coding tasks — then profiles the results across capability axes.

# Step 1: Initialize a benchmark task
cd your-project  # Directory with your CLAUDE.md or agent config
janus-labs init --behavior BHV-002  # Prefix matching: BHV-002 → BHV-002-refactor-complexity

# Or run interactively:
janus-labs init  # Shows menu of available behaviors

# This creates a task workspace:
#   src/calculator.py    - Starter code with a bug
#   tests/test_calc.py   - Tests that currently fail
#   .janus-task.json     - Task metadata
#   README.md            - Instructions for your agent

# Step 2: Let your AI agent solve it
# Use Claude Code, Cursor, Copilot, Windsurf, or any AI coding assistant
# Ask your agent: "Fix the bug in calculator.py so tests pass"

# Step 3: Score the result
janus-labs score

# Captures REAL git diffs and runs REAL pytest
# Output:
#   Score: 83.6 (Grade A)
#   Config: CLAUDE.md (hash: a1b2c3d4)
#   Behaviors: Test integrity preserved ✓

# Step 4: Submit to leaderboard (optional)
janus-labs submit result.json --github your-handle

View Your Agent's Profile

After scoring, generate a capability profile to see your agent's strengths and blindspots:

# Generate radar profile from baselines
janus-labs profile --baselines-dir data/baselines

# Output:
# Agent/Model                    Composite  Code Q  Err Res  Context  Grade
# -------------------------------------------------------------------------
# claude/claude-sonnet-4-5           87.3    90.9    83.4     87.6      A
# codex/gpt-4o                       82.4    75.7    83.4     88.2      B
# gemini/gemini-2.5-pro              44.6    43.7     0.0     90.2      F

# Compare your result against the vanilla baseline
janus-labs compare result.json --auto-baseline

Alternative: Install from Source

git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Reference

All commands can be run as:

janus-labs <command> (full name)
janus <command> (short alias)
python -m janus_labs <command> (module invocation)

Global Options

janus-labs --version   # Show version number
janus-labs --help      # Show help
janus-labs             # Launch interactive menu (no args)

Suite Shortcuts (New in v0.6.0)

Run suites directly without run --suite:

janus-labs refactor-storm        # Same as: janus-labs run --suite refactor-storm
janus-labs refactor-storm --mock # With mock scoring

`init` - Initialize Benchmark Task (Start Here)

janus-labs init [options]

Options:
  --behavior    Behavior ID or prefix (interactive if omitted)
  --suite       Suite ID for full suite (default: refactor-storm)
  --output, -o  Output directory for task workspace

# Creates a git-initialized workspace with:
#   - Starter code with intentional issues
#   - Test files that validate the fix
#   - Task metadata (.janus-task.json)
#   - .gitignore (auto-excludes result.json)

Features:

Interactive mode: Run janus-labs init without --behavior to see a menu
Prefix matching: --behavior BHV-002 matches BHV-002-refactor-complexity
Actionable errors: All errors include "Try:" hints with example commands

`status` - Check Workspace Status

janus-labs status [options]

Options:
  --workspace, -w  Path to workspace (default: current directory)

# Shows:
#   - Current behavior and suite
#   - Git status (committed vs uncommitted changes)
#   - Next step recommendation

`score` - Score Completed Task

janus-labs score [options]

Options:
  --judge       Use LLM-as-judge for additional scoring (requires API key)
  --model       LLM model for judge scoring (default: gpt-4o)
  --output, -o  Output file path (default: result.json)

# Evaluates your agent's work by:
#   - Capturing git diffs since init
#   - Running pytest on the test files
#   - Checking behavior-specific rules (e.g., test cheating detection)

`submit` - Submit to Leaderboard

janus-labs submit <result.json> [options]

Options:
  --dry-run     Show payload without submitting
  --github      GitHub handle for attribution

Zero friction - no API key required for public leaderboard. Anti-cheat is handled via workspace hash validation.

`compare` - Regression Detection

janus-labs compare <baseline.json> <current.json> [options]

Options:
  --threshold   Regression threshold percentage (default: 5.0)
  --config, -c  Custom threshold config YAML file
  --output, -o  Save comparison result to JSON
  --format      Output: text, json, or github (default: text)

Exit codes:

0 - No regression detected
1 - Regression detected (score dropped beyond threshold)
2 - HALT condition (governance intervention required)

`run` - Execute Full Suite (Advanced)

janus-labs run --suite <suite-id> [options]

Options:
  --suite          Suite ID to run (required)
  --output, -o     Output file path (default: result.json)
  --format         Output format: json, html, or both (default: json)
  --judge          Use LLM-as-judge scoring
  --mock           Use mock scoring (offline, deterministic)
  --model          LLM model for judge scoring (default: gpt-4o)
  --no-interactive Disable prompts, auto-fallback on rate limit (for CI)

Rate Limit Resilience (v0.6.0): The backend judge now includes circuit breaker and exponential backoff. If rate limited, you'll be prompted to wait, switch to mock scoring, or abort.

`smoke-test` - Quick Validation

Note: This command uses mock data for quick validation. It does NOT test your CLAUDE.md or agent configuration. Use init → score for real benchmarking.

janus-labs smoke-test [options]

Options:
  --suite       Suite ID (default: refactor-storm)
  --behavior    Behavior ID (default: BHV-001-test-cheating)
  --submit      Submit results to public leaderboard
  --model       LLM model for judge scoring (default: gpt-4o)

Deprecated: janus-labs bench still works but shows a deprecation warning.

`export` - Convert Result Formats

janus-labs export <input.json> --format <html|json> [-o output]

`profile` - Capability Profiling

# Generate radar profiles from baselines
janus-labs profile --baselines-dir data/baselines

# Single baseline profile
janus-labs profile --baseline data/baselines/baseline_claude_opus-4-6.json

# SVG radar leaderboard (top-5 overlay)
janus-labs profile --leaderboard

# K=3 reliability mode (variance-aware)
janus-labs profile --reliability

# JSON output for automation
janus-labs profile --baselines-dir data/baselines --json

`diagnose` - Instruction Health Check

# Compare configured scores vs vanilla baseline per-behavior
janus-labs diagnose result.json

# Detects instruction-behavior interference patterns
# Output: per-behavior delta, interference warnings, recommendations

`baseline` - Manage Baselines

# List all pre-computed baselines (14 agent/model combinations)
janus-labs baseline list

# Promote a result to baseline
janus-labs baseline update <result.json> [-o baseline.json] [--force]

# Show baseline info
janus-labs baseline show <baseline.json>

Built-in Suites

Refactor Storm (v1.3.0)

7 behaviors mapping to 3 capability axes:

Behavior	Capability Axis	Description
BHV-001 Test Cheating	Code Quality	Agent fixes code, not tests
BHV-002 Refactor Complexity	Code Quality	Reduce cyclomatic complexity
BHV-003 Error Handling	Error Resilience	Add comprehensive error handling
BHV-004 Loop Detection	Error Resilience	Fix bugs without thrashing
BHV-005 Context Retention	Context Retention	Fix type inconsistency across 3 files
O-2.01 Instruction Adherence	—	Follow specific constraints
O-3.01 Code Quality	—	Produce clean, maintainable code

Creating Custom Behaviors

Define behaviors using BehaviorSpec:

from forge.behavior import BehaviorSpec

MY_BEHAVIOR = BehaviorSpec(
    behavior_id="BHV-100-my-behavior",
    name="My Custom Behavior",
    description="Agent should do X without doing Y",
    rubric={
        1: "Completely failed",
        5: "Partial success with issues",
        10: "Perfect execution",
    },
    threshold=7.0,
    disconfirmers=["Agent did Y", "Agent skipped X"],
    taxonomy_code="O-1.01",  # See docs/TAXONOMY.md
    version="1.0.0",
)

Architecture

janus-labs/
├── janus_labs/    # Python package (for python -m janus_labs)
├── cli/           # Command-line interface
├── config/        # Configuration detection
├── forge/         # Behavior specifications
├── gauge/         # DeepEval integration + Trust Elasticity
├── governance/    # Janus Protocol bridge (optional)
├── harness/       # Test execution sandbox
├── probe/         # Behavior discovery (Phoenix integration)
├── scaffold/      # Task workspace templates
├── suite/         # Suite definitions + exporters
└── tests/         # Test suite

VSCode Extension (New in v0.6.0)

A VSCode extension is available for command palette integration:

Features:

Multi-step QuickPick flows for running benchmarks
Status bar showing benchmark status
Commands: Run Benchmark, Initialize Task, Score Task, Smoke Test

Installation: Build from source in vscode-extension/ directory:

cd vscode-extension
npm install
npm run compile
npm run package  # Creates .vsix file

Install via: Extensions > ... > Install from VSIX

Integration

GitHub Actions

- name: Run Janus Labs Benchmark
  run: |
    pip install janus-labs
    janus-labs run --suite refactor-storm
    janus-labs compare baseline.json result.json --format github

CI/CD Regression Gating

Janus Labs provides deterministic pass/fail for CI pipelines with configurable per-behavior thresholds.

Baseline Workflow

# 1. Establish a baseline (first time or after intentional changes)
janus-labs run --suite refactor-storm -o result.json
janus-labs baseline update result.json -o baseline.json
git add baseline.json && git commit -m "Update baseline"

# 2. In CI: Compare against baseline
janus-labs run --suite refactor-storm -o current.json --no-interactive
janus-labs compare baseline.json current.json --format github
# Exit code: 0=pass, 1=regression, 2=error

# 3. Update baseline when scores improve
janus-labs baseline update current.json -o baseline.json --force

Exit Codes

Code	Meaning	CI Action
`0`	Pass - within thresholds	Continue pipeline
`1`	Regression - score dropped beyond threshold	Fail build
`2`	Error - incompatible results or HALT condition	Fail build, investigate

Threshold Configuration

Create a thresholds.yaml for per-behavior regression limits:

# thresholds.yaml
suite_id: refactor-storm
default_max_regression_pct: 5.0    # Default: fail if score drops >5%
default_min_score: 60.0            # Optional: absolute minimum score
fail_on_any_halt: true             # Fail if governance HALT triggered

behaviors:
  BHV-001-test-cheating:
    max_regression_pct: 3.0        # Stricter for critical behaviors
    min_score: 70.0
    required: true

  BHV-002-loop-detection:
    max_regression_pct: 10.0       # More lenient for experimental
    required: false                # Won't fail build if missing

  BHV-003-context-retention:
    max_regression_pct: 5.0

Use in CI:

janus-labs compare baseline.json current.json --config thresholds.yaml

Comparison JSON Output

The --output flag produces a JSON artifact for CI systems:

{
  "suite_id": "refactor-storm",
  "suite_version": "1.0.0",
  "verdict": "pass",
  "headline_baseline": 79.2,
  "headline_current": 81.5,
  "headline_delta_pct": 2.9,
  "regressions": 0,
  "warnings": 0,
  "passes": 3,
  "exit_code": 0,
  "ci_message": "PASS: 0 regressions, 0 warnings, headline 81.5 (+2.9%)",
  "behavior_comparisons": [
    {
      "behavior_id": "BHV-001-test-cheating",
      "baseline_score": 79.3,
      "current_score": 82.1,
      "delta_pct": 3.5,
      "threshold_pct": 5.0,
      "verdict": "pass",
      "message": "within thresholds"
    }
  ]
}

GitHub Actions Full Example

name: Benchmark Regression

on:
  push:
    branches: [main]
  pull_request:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install Janus Labs
        run: pip install janus-labs

      - name: Run Benchmark
        run: janus-labs run --suite refactor-storm -o current.json --no-interactive --mock

      - name: Compare to Baseline
        run: |
          janus-labs compare baseline.json current.json \
            --config thresholds.yaml \
            --format github \
            --output comparison.json

      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: |
            current.json
            comparison.json

With Janus Protocol

Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.

Requirements

Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
Core dependencies: DeepEval, GitPython, PyYAML, Pydantic

Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
pip install -r requirements-phoenix.txt

Third-Party Licenses

DeepEval - Apache 2.0
Arize Phoenix - Elastic License 2.0

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache 2.0 - See LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.6

Apr 10, 2026

1.1.5

Apr 6, 2026

1.1.4

Apr 3, 2026

1.1.3

Apr 3, 2026

1.1.2

Apr 3, 2026

1.1.1

Apr 3, 2026

1.1.0

Apr 2, 2026

1.0.0

Mar 7, 2026

0.10.0

Mar 6, 2026

This version

0.8.4

Feb 28, 2026

0.8.3

Feb 28, 2026

0.8.2

Feb 28, 2026

0.8.1

Feb 13, 2026

0.8.0

Feb 12, 2026

0.6.8

Feb 6, 2026

0.6.7

Feb 5, 2026

0.6.6

Feb 5, 2026

0.6.5

Feb 1, 2026

0.6.4

Jan 25, 2026

0.6.3

Jan 25, 2026

0.6.2

Jan 24, 2026

0.6.1

Jan 24, 2026

0.6.0

Jan 23, 2026

0.5.0

Jan 23, 2026

0.4.1

Jan 23, 2026

0.4.0

Jan 23, 2026

0.3.6

Jan 23, 2026

0.3.5

Jan 21, 2026

0.3.4

Jan 21, 2026

0.3.3

Jan 20, 2026

0.3.2

Jan 20, 2026

0.3.1

Jan 20, 2026

0.2.0

Jan 19, 2026

0.1.2

Jan 18, 2026

0.1.1

Jan 18, 2026

0.1.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-0.8.4.tar.gz (188.0 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

janus_labs-0.8.4-py3-none-any.whl (178.6 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file janus_labs-0.8.4.tar.gz.

File metadata

Download URL: janus_labs-0.8.4.tar.gz
Upload date: Feb 28, 2026
Size: 188.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-0.8.4.tar.gz
Algorithm	Hash digest
SHA256	`943f197493f3e40deca360d237e8ad3a7e6f9f67223a9129f6336b7c6e53ace1`
MD5	`73eba3f15e8be2898b8eebe5d00fbd5b`
BLAKE2b-256	`4e733f220465cc1afbf9611c2a74f39bb0560f0a90fb5fbaad3d0750099534e0`

See more details on using hashes here.

File details

Details for the file janus_labs-0.8.4-py3-none-any.whl.

File metadata

Download URL: janus_labs-0.8.4-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 178.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-0.8.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c88e349419458e75bca08a305512b229b04da37b4cc2fa6afda854d53cdc8bc0`
MD5	`1d5dc6d999bc271df473f5999969d83c`
BLAKE2b-256	`09f7ea1bccc44aa0260da7efbca6d0baeaef6d2f3c7c9739d841280755b47297`

See more details on using hashes here.

janus-labs 0.8.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Janus Labs

What is Janus Labs?

Quick Start

Install

Interactive Mode (New in v0.6.0)

Run Your First Benchmark

View Your Agent's Profile

Alternative: Install from Source

CLI Reference

Global Options

Suite Shortcuts (New in v0.6.0)

init - Initialize Benchmark Task (Start Here)

status - Check Workspace Status

score - Score Completed Task

submit - Submit to Leaderboard

compare - Regression Detection

run - Execute Full Suite (Advanced)

smoke-test - Quick Validation

export - Convert Result Formats

profile - Capability Profiling

diagnose - Instruction Health Check

baseline - Manage Baselines

Built-in Suites

Refactor Storm (v1.3.0)

Creating Custom Behaviors

Architecture

VSCode Extension (New in v0.6.0)

Integration

GitHub Actions

CI/CD Regression Gating

Baseline Workflow

Exit Codes

Threshold Configuration

Comparison JSON Output

GitHub Actions Full Example

With Janus Protocol

Requirements

Third-Party Licenses

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init` - Initialize Benchmark Task (Start Here)

`status` - Check Workspace Status

`score` - Score Completed Task

`submit` - Submit to Leaderboard

`compare` - Regression Detection

`run` - Execute Full Suite (Advanced)

`smoke-test` - Quick Validation

`export` - Convert Result Formats

`profile` - Capability Profiling

`diagnose` - Instruction Health Check

`baseline` - Manage Baselines