Skip to main content

3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience

Project description

Janus Labs

CI Baselines Python 3.12+ License

3DMark for AI Agents. Profile AI coding agents across two active axes: Code Quality and Error Resilience. Measure Instruction Resilience separately with janus-labs diagnose when you have configured and vanilla results to compare.

What Janus Labs Does

Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one opaque number, it runs a fixed suite of coding tasks and produces a capability profile that shows where an agent is strong, weak, or uneven.

  • refactor-storm ships 4 built-in behaviors grouped into 2 active axes
  • janus-labs run is the primary workflow for generating a full suite result
  • Backend-hosted judging is available without a local API key, and --mock supports offline dry runs
  • Results can be compared against bundled baselines and submitted to a public leaderboard

Public leaderboard: https://fulfilling-courtesy-production-9c2c.up.railway.app

Quick Start

Install

pip install janus-labs
janus-labs --version

If janus-labs is not on your PATH, use:

python -m janus_labs --version

Run The Full Suite

The primary workflow is run: execute the full suite, save one result file, then submit it.

# Run the full suite offline with mock scoring
janus-labs run --suite refactor-storm --mock -o result.json

# Or use backend-hosted judging (no API key needed)
janus-labs run --suite refactor-storm -o result.json

# Suite alias for the same workflow
janus-labs refactor-storm -o result.json

# Submit to the leaderboard
janus-labs submit result.json --github your-handle

On Windows, replace janus-labs with python -m janus_labs if the launcher is not available in PATH.

Alternative: Single-Behavior Manual Workflow

Use init -> status -> score when you want to hand a single behavior workspace to an external agent and inspect the repo diff yourself.

# Generate one workspace per behavior
janus-labs init --suite refactor-storm --output ./janus-task

# Let your agent work in one behavior directory
cd janus-task/BHV-001-test-cheating

# Inspect the workspace and score the finished result
janus-labs status --workspace .
janus-labs score --workspace . --output result.json

Compare Against Baselines

# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines

# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline

# Measure optional instruction resilience
janus-labs diagnose result.json

Install From Source

git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Reference

All commands can be run as:

  • janus-labs <command>
  • janus <command>
  • python -m janus_labs <command>

Global

janus-labs --help
janus-labs --version
janus-labs

Running janus-labs with no arguments opens the interactive menu.

run

Run an end-to-end suite directly.

janus-labs run --suite refactor-storm --mock -o result.json
janus-labs run --suite refactor-storm -o result.json
janus-labs refactor-storm -o result.json

Options:

  • --suite: suite ID, required
  • --output, -o: output file, default result.json
  • --format: json, html, or both
  • --mock: use deterministic offline scoring
  • --judge: use local LLM-as-judge scoring
  • --model: judge model, default gpt-4o
  • --no-interactive: disable prompts on backend rate limits
  • --request-delay: delay between judge requests
  • --full: run the full init -> agent -> score -> output pipeline

init

Initialize workspaces for every behavior in a suite.

janus-labs init --suite refactor-storm --output ./janus-task

Options:

  • --suite: suite ID, default refactor-storm
  • --output, -o: output directory for generated behavior workspaces

status

Inspect a task workspace and get the recommended next step.

janus-labs status --workspace ./janus-task/BHV-001-test-cheating

score

Score one completed task workspace.

janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json

Options:

  • --workspace, -w: workspace path, default current directory
  • --output, -o: output file, default result.json
  • --judge: enable LLM-as-judge scoring
  • --model: judge model, default gpt-4o
  • --bundle: optional bundle file for judge scoring
  • --agent: override detected agent identifier
  • --agent-model: override detected agent model

submit

Submit a scored result to the public leaderboard.

janus-labs submit result.json --github your-handle

Options:

  • --dry-run: print the payload without submitting
  • --github: GitHub handle for attribution
  • --no-judge: skip backend LLM judging during submission

compare

Detect regressions between two results, or compare a result against a precomputed vanilla baseline.

janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline

profile

Generate capability profiles from one baseline file or a directory of baselines.

janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability

diagnose

Measure optional instruction resilience by comparing configured results against vanilla baselines.

janus-labs diagnose result.json

smoke-test

Quick validation path using mock data. This does not benchmark your real config.

janus-labs smoke-test --suite refactor-storm

baseline

Inspect or update baseline files.

janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json

Built-In Suite

Refactor Storm (v2.0.0)

refactor-storm ships 4 built-in behaviors grouped into 2 active axes:

  • BHV-001 Test Cheating Detection -> Code Quality
  • BHV-002 Reduce Cyclomatic Complexity -> Code Quality
  • BHV-003 Add Comprehensive Error Handling -> Error Resilience
  • BHV-004 Loop Detection Compliance -> Error Resilience

Composite score = avg(Code Quality, Error Resilience).

janus-labs diagnose can compute optional Instruction Resilience separately by comparing configured runs against vanilla baselines. It is not part of the 2-axis composite.

GitHub Actions

- name: Install Janus Labs
  run: pip install janus-labs

- name: Run mock suite
  run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json

- name: Compare to baseline
  run: janus-labs compare baseline.json current.json --format github

Requirements

  • Python 3.12+
  • Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic

Phoenix telemetry is optional and requires Python <3.14:

pip install -r requirements-phoenix.txt

Contributing

See CONTRIBUTING.md.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-1.1.3.tar.gz (202.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

janus_labs-1.1.3-py3-none-any.whl (204.5 kB view details)

Uploaded Python 3

File details

Details for the file janus_labs-1.1.3.tar.gz.

File metadata

  • Download URL: janus_labs-1.1.3.tar.gz
  • Upload date:
  • Size: 202.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-1.1.3.tar.gz
Algorithm Hash digest
SHA256 daddc4890318ec236e525f7945484adb3e1c3f41ed07af3c0045f8e37f6551d8
MD5 5f718c564fd5ed417aece0715abff2a3
BLAKE2b-256 30d33e8b20a38f0b9db033f95e0747b535a3a8d46a5990395d0cc933a5a9789a

See more details on using hashes here.

File details

Details for the file janus_labs-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: janus_labs-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 204.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for janus_labs-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3371189cd082bcc9d5c2c8c3a2c1698ae63bbbca2891921df8dba5cf26b9477a
MD5 eb214f9846c01ef2c33042e0eea492ba
BLAKE2b-256 413d24072f0e5b38dd961ef417d45beb30a1e191a5c272bc704e026974172e9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page