Skip to main content

3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and context retention

Project description

Janus Labs

CI Baselines Python 3.12+ License

3DMark for AI Agents. Profile AI coding agents across code quality, error resilience, and context retention.

What Janus Labs Does

Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one score, it produces a capability profile so you can see where an agent is strong, weak, or uneven.

  • Multi-axis profiling across code quality, error resilience, and context retention
  • Reproducible task workspaces with real source files, tests, and git diffs
  • Public leaderboard and shareable result pages
  • Baseline comparison against precomputed agent/model runs

Public leaderboard: https://fulfilling-courtesy-production-9c2c.up.railway.app

Quick Start

Install

pip install janus-labs
janus-labs --version

If janus-labs is not on your PATH, use:

python -m janus_labs --version

Benchmark Your Agent

Janus Labs currently initializes a full suite workspace. Each behavior gets its own git-initialized subdirectory.

# 1. Generate the suite workspaces
janus-labs init --suite refactor-storm --output ./janus-task

That creates a structure like:

janus-task/
  BHV-001-test-cheating/
  BHV-002-refactor-complexity/
  BHV-003-error-handling/
  ...
# 2. Pick one behavior workspace and let your agent work inside it
cd janus-task/BHV-001-test-cheating
# 3. Check workspace state
janus-labs status --workspace .
# 4. Score the completed task
janus-labs score --workspace . --output result.json
# 5. Submit to the leaderboard (optional)
janus-labs submit result.json --github your-handle

Compare Against Baselines

# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines

# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline

Install From Source

git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Reference

All commands can be run as:

  • janus-labs <command>
  • janus <command>
  • python -m janus_labs <command>

Global

janus-labs --help
janus-labs --version
janus-labs

Running janus-labs with no arguments opens the interactive menu.

init

Initialize workspaces for every behavior in a suite.

janus-labs init --suite refactor-storm --output ./janus-task

Options:

  • --suite: suite ID, default refactor-storm
  • --output, -o: output directory for generated behavior workspaces

status

Inspect a task workspace and get the recommended next step.

janus-labs status --workspace ./janus-task/BHV-001-test-cheating

score

Score one completed task workspace.

janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json

Options:

  • --workspace, -w: workspace path, default current directory
  • --output, -o: output file, default result.json
  • --judge: enable LLM-as-judge scoring
  • --model: judge model, default gpt-4o
  • --bundle: optional bundle file for judge scoring
  • --agent: override detected agent identifier
  • --agent-model: override detected agent model

submit

Submit a scored result to the public leaderboard.

janus-labs submit result.json --github your-handle

Options:

  • --dry-run: print the payload without submitting
  • --github: GitHub handle for attribution

compare

Detect regressions between two results, or compare a result against a precomputed vanilla baseline.

janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline

profile

Generate capability profiles from one baseline file or a directory of baselines.

janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability

run

Run an end-to-end suite directly.

janus-labs run --suite refactor-storm --mock

smoke-test

Quick validation path using mock data. This does not benchmark your real config.

janus-labs smoke-test --suite refactor-storm

diagnose

Analyze instruction-file interference patterns from a scored result.

janus-labs diagnose result.json

baseline

Inspect or update baseline files.

janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json

Built-In Suite

Refactor Storm (v1.5.0)

refactor-storm currently ships 10 built-in behaviors:

  • BHV-001 Test Cheating Detection
  • BHV-002 Refactor Complexity
  • BHV-003 Error Handling
  • BHV-004 Loop Detection
  • BHV-005 Context Retention
  • BHV-008 Error Propagation Chain
  • BHV-009 Cross-Module Extract Method
  • BHV-010 Integration Contract Change
  • O-2.01 Instruction Adherence
  • O-3.01 Code Quality

The public radar profile is still centered on the validated 3-axis view. Tier 2 multi-file behaviors are available in the suite and are being recalibrated for the next profile revision.

GitHub Actions

- name: Install Janus Labs
  run: pip install janus-labs

- name: Run mock suite
  run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json

- name: Compare to baseline
  run: janus-labs compare baseline.json current.json --format github

Requirements

  • Python 3.12+
  • Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic

Phoenix telemetry is optional and requires Python <3.14:

pip install -r requirements-phoenix.txt

Contributing

See CONTRIBUTING.md.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

janus_labs-0.10.0.tar.gz (192.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

janus_labs-0.10.0-py3-none-any.whl (194.3 kB view details)

Uploaded Python 3

File details

Details for the file janus_labs-0.10.0.tar.gz.

File metadata

  • Download URL: janus_labs-0.10.0.tar.gz
  • Upload date:
  • Size: 192.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-0.10.0.tar.gz
Algorithm Hash digest
SHA256 b502929d86ee5b2b73c056721162f31eb8b6f585b99fcb345e0d7e078a3f3307
MD5 5a314e8f7dd6e199f94b8a098f1689bc
BLAKE2b-256 3a0b89d5c46d7b5581219ee5b5d8767bbd2a23646ad20866d51c11373001e95e

See more details on using hashes here.

File details

Details for the file janus_labs-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: janus_labs-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 194.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for janus_labs-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5cd08e4377fd9bc975f612606eb25653646ad959da4cdf1eb574ebb2fb416693
MD5 a882c48e86a56a1004f7f75b0905dadb
BLAKE2b-256 691562084d5ccfe8fce9bff51fbb06c97d5d7d836efea5528364e8263a5649d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page