3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience
Project description
Janus Labs
3DMark for AI Agents. Profile AI coding agents across two active axes: Code Quality and Error Resilience. Measure Instruction Resilience separately with janus-labs diagnose when you have configured and vanilla results to compare.
What Janus Labs Does
Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one opaque number, it runs a fixed suite of coding tasks and produces a capability profile that shows where an agent is strong, weak, or uneven.
refactor-stormships 4 built-in behaviors grouped into 2 active axesjanus-labs runis the primary workflow for generating a full suite result- Backend-hosted judging is available without a local API key, and
--mocksupports offline dry runs - Results can be compared against bundled baselines and submitted to a public leaderboard
Public leaderboard: https://fulfilling-courtesy-production-9c2c.up.railway.app
Quick Start
1. Install
pip install janus-labs
janus-labs doctor
doctor checks Python version, dependencies, which agent CLIs are on your PATH, and API keys. On Windows, use python -m janus_labs if janus-labs is not in PATH.
2. Benchmark Your Agent
The primary workflow: run the full suite with your agent, then submit.
# Benchmark Codex (or claude, gemini, copilot)
janus-labs run --full --agent codex --suite refactor-storm -o result.json
# Custom agent command (any CLI that accepts a prompt)
janus-labs run --full --agent-cmd "my-agent --prompt {prompt_file}" --suite refactor-storm -o result.json
# Submit to the public leaderboard
janus-labs submit result.json --github your-handle
This initializes 4 behavior workspaces, runs your agent on each, scores the outcomes, and produces a single result file. Built-in agent presets: codex, claude, gemini, copilot.
3. Try It Without an Agent
Don't have an agent CLI handy? Use mock or backend-hosted scoring to explore the pipeline:
# Offline mock scoring (instant, deterministic, no API key)
janus-labs run --suite refactor-storm --mock -o result.json
# Backend-hosted judging (no local API key needed)
janus-labs run --suite refactor-storm -o result.json
# Suite alias shortcut
janus-labs refactor-storm -o result.json
These modes score the unmodified scaffold code and are useful for testing the pipeline, CI setup, or exploring the output format.
4. Compare and Profile
# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline
# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines
# Measure optional instruction resilience (needs configured + vanilla results)
janus-labs diagnose result.json
Alternative: Single-Behavior Manual Workflow
Use init -> status -> score when you want to hand a single behavior workspace to an external agent and inspect the repo diff yourself.
janus-labs init --suite refactor-storm --output ./janus-task
cd janus-task/BHV-001-test-cheating
# ... let your agent work ...
janus-labs score --workspace . --output result.json
Install From Source
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .
CLI Reference
All commands can be run as:
janus-labs <command>janus <command>python -m janus_labs <command>
Global
janus-labs --help
janus-labs --version
janus-labs
Running janus-labs with no arguments opens the interactive menu.
run
Run an end-to-end suite directly.
janus-labs run --suite refactor-storm --mock -o result.json
janus-labs run --suite refactor-storm -o result.json
janus-labs refactor-storm -o result.json
Options:
--full: run the fullinit -> agent -> score -> outputpipeline (the primary workflow)--agent: built-in agent preset (codex,claude,gemini,copilot). Requires--full--agent-cmd: custom agent command template. Use{prompt_file}for file path or{prompt_content}for inline prompt. Requires--full--timeout: agent timeout in seconds per behavior (default: 300)--suite: suite ID, required--output,-o: output file, defaultresult.json--format:json,html, orboth--mock: use deterministic offline scoring (no agent needed)--judge: use local LLM-as-judge scoring--model: judge model, defaultgpt-4o--no-interactive: disable prompts on backend rate limits--request-delay: delay between judge requests
init
Initialize workspaces for every behavior in a suite.
janus-labs init --suite refactor-storm --output ./janus-task
Options:
--suite: suite ID, defaultrefactor-storm--output,-o: output directory for generated behavior workspaces
status
Inspect a task workspace and get the recommended next step.
janus-labs status --workspace ./janus-task/BHV-001-test-cheating
score
Score one completed task workspace.
janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json
Options:
--workspace,-w: workspace path, default current directory--output,-o: output file, defaultresult.json--judge: enable LLM-as-judge scoring--model: judge model, defaultgpt-4o--bundle: optional bundle file for judge scoring--agent: override detected agent identifier--agent-model: override detected agent model
submit
Submit a scored result to the public leaderboard.
janus-labs submit result.json --github your-handle
Options:
--dry-run: print the payload without submitting--github: GitHub handle for attribution--no-judge: skip backend LLM judging during submission--no-open: skip opening the share URL in a browser (useful in CI)
compare
Detect regressions between two results, or compare a result against a precomputed vanilla baseline.
janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline
profile
Generate capability profiles from one baseline file or a directory of baselines.
janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability
diagnose
Measure optional instruction resilience by comparing configured results against vanilla baselines.
janus-labs diagnose result.json
smoke-test
Quick validation path using mock data. This does not benchmark your real config.
janus-labs smoke-test --suite refactor-storm
baseline
Inspect or update baseline files.
janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json
Built-In Suite
Refactor Storm (v2.0.0)
refactor-storm ships 4 built-in behaviors grouped into 2 active axes:
BHV-001Test Cheating Detection ->Code QualityBHV-002Reduce Cyclomatic Complexity ->Code QualityBHV-003Add Comprehensive Error Handling ->Error ResilienceBHV-004Loop Detection Compliance ->Error Resilience
Composite score = avg(Code Quality, Error Resilience).
janus-labs diagnose can compute optional Instruction Resilience separately by comparing configured runs against vanilla baselines. It is not part of the 2-axis composite.
GitHub Actions
- name: Install Janus Labs
run: pip install janus-labs
- name: Run mock suite
run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json
- name: Compare to baseline
run: janus-labs compare baseline.json current.json --format github
Requirements
- Python 3.12+
- Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic
Phoenix telemetry is optional and requires Python <3.14:
pip install -r requirements-phoenix.txt
Contributing
See CONTRIBUTING.md.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file janus_labs-1.1.6.tar.gz.
File metadata
- Download URL: janus_labs-1.1.6.tar.gz
- Upload date:
- Size: 218.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed061b3140c7bbf7c3c38d8c61c3c74348a96814e979a0b54d187893747fb06e
|
|
| MD5 |
a624e4a23edb72b24a7ae88c47f72d1e
|
|
| BLAKE2b-256 |
2a5a801bb2490401dd74435a32750121671c8aa30f46f64e895df76b3423313d
|
File details
Details for the file janus_labs-1.1.6-py3-none-any.whl.
File metadata
- Download URL: janus_labs-1.1.6-py3-none-any.whl
- Upload date:
- Size: 210.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed264d19583ffe9a1b6d055ff360955bb5d3a4dee3122db0db3d5f2112cf4557
|
|
| MD5 |
6bbd47c92714886aa0b2352454a0a4d4
|
|
| BLAKE2b-256 |
6d616bda36a73c1e96925f0683a415fa607146246d709076d69372c5a96b3202
|