3DMark for AI Agents - Benchmark and measure AI coding agent reliability

These details have not been verified by PyPI

Project links

Project description

Janus Labs

3DMark for AI Agents — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.

What is Janus Labs?

Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:

Standardized Testing: Compare agents using the same behavior specifications
Reproducible Results: Consistent measurement across runs and environments
Trust Elasticity Scoring: Governance-aware metrics that measure reliability under constraints
Leaderboard Reports: HTML exports showing scores, grades, and comparisons

Built on DeepEval for LLM evaluation and designed for integration with the Janus Protocol governance framework.

Quick Start

Option A: Install from PyPI (Recommended)

pip install janus-labs

Then run:

# If janus-labs is in your PATH:
janus-labs bench --suite refactor-storm

# If not in PATH (common on Windows), use module syntax:
python -m janus_labs bench --suite refactor-storm

Windows PATH Troubleshooting

On Windows, pip install --user installs executables to a location not in PATH by default: %APPDATA%\Python\Python3XX\Scripts

Option 1: Use module syntax (no PATH changes needed)

python -m janus_labs bench --suite refactor-storm

Option 2 - Add Scripts to PATH:

# Find your Scripts folder
python -c "import sysconfig; print(sysconfig.get_path('scripts'))"

# Add that path to your system PATH environment variable

Option 3 - Use a virtual environment:

python -m venv .venv
.venv\Scripts\activate
pip install janus-labs
janus-labs bench  # Works because venv Scripts is in activated PATH

Option B: Install from Source (Development)

# Clone the repository
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in editable mode
pip install -e .

Run Your First Benchmark

# Zero-friction benchmark (detects your config, scores, shows result)
janus-labs bench

# Or with module syntax:
python -m janus_labs bench

# Full suite run with HTML report:
janus-labs run --suite refactor-storm --format both

# This creates:
#   result.json  - Machine-readable results
#   result.html  - Visual leaderboard report

3. View Results

Open result.html in your browser to see:

Headline score (0-100) with letter grade (S/A/B/C/D/F)
Per-behavior breakdown
Configuration badge showing default vs custom agent config

4. Compare Runs

# Save current result as baseline
python -m cli baseline update result.json --output baseline.json

# Run again and compare
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json

CLI Reference

All commands can be run as janus-labs <command> or python -m janus_labs <command>.

`bench` - Zero-Friction Benchmark (Recommended)

janus-labs bench [options]

Options:
  --suite       Suite ID (default: refactor-storm)
  --behavior    Behavior ID (default: BHV-001-test-cheating)
  --submit      Submit results to public leaderboard
  --github      GitHub handle for attribution (requires --submit)
  --model       LLM model for judge scoring (default: gpt-4o)
  --no-copy     Don't copy share URL to clipboard

`run` - Execute a Benchmark Suite

janus-labs run --suite <suite-id> [options]

Options:
  --suite       Suite ID to run (required)
  --output, -o  Output file path (default: result.json)
  --format      Output format: json, html, or both (default: json)
  --judge       Use LLM-as-judge scoring (slower but differentiates)
  --model       LLM model for judge scoring (default: gpt-4o)

`submit` - Submit to Leaderboard

janus-labs submit <result.json> [options]

Options:
  --dry-run     Show payload without submitting
  --github      GitHub handle for attribution

`compare` - Regression Detection

janus-labs compare <baseline.json> <current.json> [options]

Options:
  --threshold   Regression threshold percentage (default: 5.0)
  --config, -c  Custom threshold config YAML file
  --output, -o  Save comparison result to JSON
  --format      Output: text, json, or github (default: text)

Exit codes:

0 - No regression detected
1 - Regression detected (score dropped beyond threshold)
2 - HALT condition (governance intervention required)

`export` - Convert Result Formats

janus-labs export <input.json> --format <html|json> [-o output]

`baseline` - Manage Baselines

# Promote a result to baseline
janus-labs baseline update <result.json> [-o baseline.json] [--force]

# Show baseline info
janus-labs baseline show <baseline.json>

Built-in Suites

Refactor Storm

Tests agent reliability during complex refactoring tasks:

Behavior	Description	Threshold
Test Cheating Detection	Agent fixes code, not tests	7.0
Loop Detection Compliance	Agent responds to governance signals	8.0
Context Retention	Agent maintains context across iterations	6.0

Creating Custom Behaviors

Define behaviors using BehaviorSpec:

from forge.behavior import BehaviorSpec

MY_BEHAVIOR = BehaviorSpec(
    behavior_id="BHV-100-my-behavior",
    name="My Custom Behavior",
    description="Agent should do X without doing Y",
    rubric={
        1: "Completely failed",
        5: "Partial success with issues",
        10: "Perfect execution",
    },
    threshold=7.0,
    disconfirmers=["Agent did Y", "Agent skipped X"],
    taxonomy_code="O-1.01",  # See docs/TAXONOMY.md
    version="1.0.0",
)

Architecture

janus-labs/
├── janus_labs/    # Python package (for python -m janus_labs)
├── cli/           # Command-line interface
├── config/        # Configuration detection
├── forge/         # Behavior specifications
├── gauge/         # DeepEval integration + Trust Elasticity
├── governance/    # Janus Protocol bridge (optional)
├── harness/       # Test execution sandbox
├── probe/         # Behavior discovery (Phoenix integration)
├── scaffold/      # Task workspace templates
├── suite/         # Suite definitions + exporters
└── tests/         # Test suite

Integration

GitHub Actions

- name: Run Janus Labs Benchmark
  run: |
    pip install janus-labs
    janus-labs run --suite refactor-storm
    janus-labs compare baseline.json result.json --format github

With Janus Protocol

Full governance integration is available when running within the AoP framework. The governance/ module bridges to Janus v3.6 for trust-elasticity tracking.

Requirements

Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
Core dependencies: DeepEval, GitPython, PyYAML, Pydantic

Note: Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
pip install -r requirements-phoenix.txt

Third-Party Licenses

DeepEval - Apache 2.0
Arize Phoenix - Elastic License 2.0

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache 2.0 - See LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.6

Apr 10, 2026

1.1.5

Apr 6, 2026

1.1.4

Apr 3, 2026

1.1.3

Apr 3, 2026

1.1.2

Apr 3, 2026

1.1.1

Apr 3, 2026

1.1.0

Apr 2, 2026

1.0.0

Mar 7, 2026

0.10.0

Mar 6, 2026

0.8.4

Feb 28, 2026

0.8.3

Feb 28, 2026

0.8.2

Feb 28, 2026

0.8.1

Feb 13, 2026

0.8.0

Feb 12, 2026

0.6.8

Feb 6, 2026

0.6.7

Feb 5, 2026

0.6.6

Feb 5, 2026

0.6.5

Feb 1, 2026

0.6.4

Jan 25, 2026

0.6.3

Jan 25, 2026

0.6.2

Jan 24, 2026

0.6.1

Jan 24, 2026

0.6.0

Jan 23, 2026

0.5.0

Jan 23, 2026

0.4.1

Jan 23, 2026

0.4.0

Jan 23, 2026

0.3.6

Jan 23, 2026

0.3.5

Jan 21, 2026

0.3.4

Jan 21, 2026

0.3.3

Jan 20, 2026

0.3.2

Jan 20, 2026

0.3.1

Jan 20, 2026

This version

0.2.0

Jan 19, 2026

0.1.2

Jan 18, 2026

0.1.1

Jan 18, 2026

0.1.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-0.2.0.tar.gz (97.7 kB view details)

Uploaded Jan 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

janus_labs-0.2.0-py3-none-any.whl (88.0 kB view details)

Uploaded Jan 19, 2026 Python 3

File details

Details for the file janus_labs-0.2.0.tar.gz.

File metadata

Download URL: janus_labs-0.2.0.tar.gz
Upload date: Jan 19, 2026
Size: 97.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fda450882471a0cb62f6aa6b52624c89a633677c48fcca27cc861fa69c8db456`
MD5	`05551e139851dc0ac64fbd6875b97d1b`
BLAKE2b-256	`f6dd3eddbc226ddacf1f77c800b6bc190cacf64310639dea0bff56eb282c9980`

See more details on using hashes here.

File details

Details for the file janus_labs-0.2.0-py3-none-any.whl.

File metadata

Download URL: janus_labs-0.2.0-py3-none-any.whl
Upload date: Jan 19, 2026
Size: 88.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2aa30f0b59fc00133aab845babde7d2c509fc1b492bb092fb5953eb29fd2b0b6`
MD5	`903170db1493222fd99c83641c6cb6aa`
BLAKE2b-256	`7ede40512a559237669ae09fe62469a30e9ac98c831e9e376bc0eaf318e69dcd`

See more details on using hashes here.

janus-labs 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Janus Labs

What is Janus Labs?

Quick Start

Option A: Install from PyPI (Recommended)

Windows PATH Troubleshooting

Option B: Install from Source (Development)

Run Your First Benchmark

3. View Results

4. Compare Runs

CLI Reference

bench - Zero-Friction Benchmark (Recommended)

run - Execute a Benchmark Suite

submit - Submit to Leaderboard

compare - Regression Detection

export - Convert Result Formats

baseline - Manage Baselines

Built-in Suites

Refactor Storm

Creating Custom Behaviors

Architecture

Integration

GitHub Actions

With Janus Protocol

Requirements

Third-Party Licenses

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`bench` - Zero-Friction Benchmark (Recommended)

`run` - Execute a Benchmark Suite

`submit` - Submit to Leaderboard

`compare` - Regression Detection

`export` - Convert Result Formats

`baseline` - Manage Baselines