3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience

These details have not been verified by PyPI

Project links

Project description

Janus Labs

3DMark for AI Agents. Profile AI coding agents across two active axes: Code Quality and Error Resilience. Measure Instruction Resilience separately with janus-labs diagnose when you have configured and vanilla results to compare.

What Janus Labs Does

Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one opaque number, it runs a fixed suite of coding tasks and produces a capability profile that shows where an agent is strong, weak, or uneven.

refactor-storm ships 4 built-in behaviors grouped into 2 active axes
janus-labs run is the primary workflow for generating a full suite result
Backend-hosted judging is available without a local API key, and --mock supports offline dry runs
Results can be compared against bundled baselines and submitted to a public leaderboard

Public leaderboard: https://fulfilling-courtesy-production-9c2c.up.railway.app

Quick Start

1. Install

pip install janus-labs
janus-labs doctor

doctor checks Python version, dependencies, which agent CLIs are on your PATH, and API keys. On Windows, use python -m janus_labs if janus-labs is not in PATH.

2. Benchmark Your Agent

The primary workflow: run the full suite with your agent, then submit.

# Benchmark Codex (or claude, gemini, copilot)
janus-labs run --full --agent codex --suite refactor-storm -o result.json

# Custom agent command (any CLI that accepts a prompt)
janus-labs run --full --agent-cmd "my-agent --prompt {prompt_file}" --suite refactor-storm -o result.json

# Submit to the public leaderboard
janus-labs submit result.json --github your-handle

This initializes 4 behavior workspaces, runs your agent on each, scores the outcomes, and produces a single result file. Built-in agent presets: codex, claude, gemini, copilot.

3. Try It Without an Agent

Don't have an agent CLI handy? Use mock or backend-hosted scoring to explore the pipeline:

# Offline mock scoring (instant, deterministic, no API key)
janus-labs run --suite refactor-storm --mock -o result.json

# Backend-hosted judging (no local API key needed)
janus-labs run --suite refactor-storm -o result.json

# Suite alias shortcut
janus-labs refactor-storm -o result.json

These modes score the unmodified scaffold code and are useful for testing the pipeline, CI setup, or exploring the output format.

4. Compare and Profile

# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline

# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines

# Measure optional instruction resilience (needs configured + vanilla results)
janus-labs diagnose result.json

Alternative: Single-Behavior Manual Workflow

Use init -> status -> score when you want to hand a single behavior workspace to an external agent and inspect the repo diff yourself.

janus-labs init --suite refactor-storm --output ./janus-task
cd janus-task/BHV-001-test-cheating
# ... let your agent work ...
janus-labs score --workspace . --output result.json

Install From Source

git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Reference

All commands can be run as:

janus-labs <command>
janus <command>
python -m janus_labs <command>

Global

janus-labs --help
janus-labs --version
janus-labs

Running janus-labs with no arguments opens the interactive menu.

`run`

Run an end-to-end suite directly.

janus-labs run --suite refactor-storm --mock -o result.json
janus-labs run --suite refactor-storm -o result.json
janus-labs refactor-storm -o result.json

Options:

--full: run the full init -> agent -> score -> output pipeline (the primary workflow)
--agent: built-in agent preset (codex, claude, gemini, copilot). Requires --full
--agent-cmd: custom agent command template. Use {prompt_file} for file path or {prompt_content} for inline prompt. Requires --full
--timeout: agent timeout in seconds per behavior (default: 300)
--suite: suite ID, required
--output, -o: output file, default result.json
--format: json, html, or both
--mock: use deterministic offline scoring (no agent needed)
--judge: use local LLM-as-judge scoring
--model: judge model, default gpt-4o
--no-interactive: disable prompts on backend rate limits
--request-delay: delay between judge requests

`init`

Initialize workspaces for every behavior in a suite.

janus-labs init --suite refactor-storm --output ./janus-task

Options:

--suite: suite ID, default refactor-storm
--output, -o: output directory for generated behavior workspaces

`status`

Inspect a task workspace and get the recommended next step.

janus-labs status --workspace ./janus-task/BHV-001-test-cheating

`score`

Score one completed task workspace.

janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json

Options:

--workspace, -w: workspace path, default current directory
--output, -o: output file, default result.json
--judge: enable LLM-as-judge scoring
--model: judge model, default gpt-4o
--bundle: optional bundle file for judge scoring
--agent: override detected agent identifier
--agent-model: override detected agent model

`submit`

Submit a scored result to the public leaderboard.

janus-labs submit result.json --github your-handle

Options:

--dry-run: print the payload without submitting
--github: GitHub handle for attribution
--no-judge: skip backend LLM judging during submission
--no-open: skip opening the share URL in a browser (useful in CI)

`compare`

Detect regressions between two results, or compare a result against a precomputed vanilla baseline.

janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline

`profile`

Generate capability profiles from one baseline file or a directory of baselines.

janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability

`diagnose`

Measure optional instruction resilience by comparing configured results against vanilla baselines.

janus-labs diagnose result.json

`smoke-test`

Quick validation path using mock data. This does not benchmark your real config.

janus-labs smoke-test --suite refactor-storm

`baseline`

Inspect or update baseline files.

janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json

Built-In Suite

Refactor Storm (`v2.0.0`)

refactor-storm ships 4 built-in behaviors grouped into 2 active axes:

BHV-001 Test Cheating Detection -> Code Quality
BHV-002 Reduce Cyclomatic Complexity -> Code Quality
BHV-003 Add Comprehensive Error Handling -> Error Resilience
BHV-004 Loop Detection Compliance -> Error Resilience

Composite score = avg(Code Quality, Error Resilience).

janus-labs diagnose can compute optional Instruction Resilience separately by comparing configured runs against vanilla baselines. It is not part of the 2-axis composite.

GitHub Actions

- name: Install Janus Labs
  run: pip install janus-labs

- name: Run mock suite
  run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json

- name: Compare to baseline
  run: janus-labs compare baseline.json current.json --format github

Requirements

Python 3.12+
Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic

Phoenix telemetry is optional and requires Python <3.14:

pip install -r requirements-phoenix.txt

Contributing

See CONTRIBUTING.md.

License

Apache 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.6

Apr 10, 2026

1.1.5

Apr 6, 2026

1.1.4

Apr 3, 2026

1.1.3

Apr 3, 2026

1.1.2

Apr 3, 2026

1.1.1

Apr 3, 2026

1.1.0

Apr 2, 2026

1.0.0

Mar 7, 2026

0.10.0

Mar 6, 2026

0.8.4

Feb 28, 2026

0.8.3

Feb 28, 2026

0.8.2

Feb 28, 2026

0.8.1

Feb 13, 2026

0.8.0

Feb 12, 2026

0.6.8

Feb 6, 2026

0.6.7

Feb 5, 2026

0.6.6

Feb 5, 2026

0.6.5

Feb 1, 2026

0.6.4

Jan 25, 2026

0.6.3

Jan 25, 2026

0.6.2

Jan 24, 2026

0.6.1

Jan 24, 2026

0.6.0

Jan 23, 2026

0.5.0

Jan 23, 2026

0.4.1

Jan 23, 2026

0.4.0

Jan 23, 2026

0.3.6

Jan 23, 2026

0.3.5

Jan 21, 2026

0.3.4

Jan 21, 2026

0.3.3

Jan 20, 2026

0.3.2

Jan 20, 2026

0.3.1

Jan 20, 2026

0.2.0

Jan 19, 2026

0.1.2

Jan 18, 2026

0.1.1

Jan 18, 2026

0.1.0

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-1.1.6.tar.gz (218.0 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

janus_labs-1.1.6-py3-none-any.whl (210.4 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file janus_labs-1.1.6.tar.gz.

File metadata

Download URL: janus_labs-1.1.6.tar.gz
Upload date: Apr 10, 2026
Size: 218.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-1.1.6.tar.gz
Algorithm	Hash digest
SHA256	`ed061b3140c7bbf7c3c38d8c61c3c74348a96814e979a0b54d187893747fb06e`
MD5	`a624e4a23edb72b24a7ae88c47f72d1e`
BLAKE2b-256	`2a5a801bb2490401dd74435a32750121671c8aa30f46f64e895df76b3423313d`

See more details on using hashes here.

File details

Details for the file janus_labs-1.1.6-py3-none-any.whl.

File metadata

Download URL: janus_labs-1.1.6-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 210.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-1.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed264d19583ffe9a1b6d055ff360955bb5d3a4dee3122db0db3d5f2112cf4557`
MD5	`6bbd47c92714886aa0b2352454a0a4d4`
BLAKE2b-256	`6d616bda36a73c1e96925f0683a415fa607146246d709076d69372c5a96b3202`

See more details on using hashes here.

janus-labs 1.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Janus Labs

What Janus Labs Does

Quick Start

1. Install

2. Benchmark Your Agent

3. Try It Without an Agent

4. Compare and Profile

Alternative: Single-Behavior Manual Workflow

Install From Source

CLI Reference

Global

run

init

status

score

submit

compare

profile

diagnose

smoke-test

baseline

Built-In Suite

Refactor Storm (v2.0.0)

GitHub Actions

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`run`

`init`

`status`

`score`

`submit`

`compare`

`profile`

`diagnose`

`smoke-test`

`baseline`

Refactor Storm (`v2.0.0`)