3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience
Project description
Janus Labs
3DMark for AI Agents. Profile AI coding agents across code quality, error resilience, and instruction resilience.
What Janus Labs Does
Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one score, it produces a capability profile so you can see where an agent is strong, weak, or uneven.
- Multi-axis profiling across code quality, error resilience, and instruction resilience
- Reproducible task workspaces with real source files, tests, and git diffs
- Public leaderboard and shareable result pages
- Baseline comparison against precomputed agent/model runs
Public leaderboard: https://fulfilling-courtesy-production-9c2c.up.railway.app
Quick Start
Install
pip install janus-labs
janus-labs --version
If janus-labs is not on your PATH, use:
python -m janus_labs --version
Benchmark Your Agent
Janus Labs currently initializes a full suite workspace. Each behavior gets its own git-initialized subdirectory.
# 1. Generate the suite workspaces
janus-labs init --suite refactor-storm --output ./janus-task
That creates a structure like:
janus-task/
BHV-001-test-cheating/
BHV-002-refactor-complexity/
BHV-003-error-handling/
...
# 2. Pick one behavior workspace and let your agent work inside it
cd janus-task/BHV-001-test-cheating
# 3. Check workspace state
janus-labs status --workspace .
# 4. Score the completed task
janus-labs score --workspace . --output result.json
# 5. Submit to the leaderboard (optional)
janus-labs submit result.json --github your-handle
Compare Against Baselines
# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines
# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline
Install From Source
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .
CLI Reference
All commands can be run as:
janus-labs <command>janus <command>python -m janus_labs <command>
Global
janus-labs --help
janus-labs --version
janus-labs
Running janus-labs with no arguments opens the interactive menu.
init
Initialize workspaces for every behavior in a suite.
janus-labs init --suite refactor-storm --output ./janus-task
Options:
--suite: suite ID, defaultrefactor-storm--output,-o: output directory for generated behavior workspaces
status
Inspect a task workspace and get the recommended next step.
janus-labs status --workspace ./janus-task/BHV-001-test-cheating
score
Score one completed task workspace.
janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json
Options:
--workspace,-w: workspace path, default current directory--output,-o: output file, defaultresult.json--judge: enable LLM-as-judge scoring--model: judge model, defaultgpt-4o--bundle: optional bundle file for judge scoring--agent: override detected agent identifier--agent-model: override detected agent model
submit
Submit a scored result to the public leaderboard.
janus-labs submit result.json --github your-handle
Options:
--dry-run: print the payload without submitting--github: GitHub handle for attribution
compare
Detect regressions between two results, or compare a result against a precomputed vanilla baseline.
janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline
profile
Generate capability profiles from one baseline file or a directory of baselines.
janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability
run
Run an end-to-end suite directly.
janus-labs run --suite refactor-storm --mock
smoke-test
Quick validation path using mock data. This does not benchmark your real config.
janus-labs smoke-test --suite refactor-storm
diagnose
Analyze instruction resilience — compare configured scores against vanilla baselines to measure instruction-file impact.
janus-labs diagnose result.json
baseline
Inspect or update baseline files.
janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json
Built-In Suite
Refactor Storm (v1.6.0)
refactor-storm currently ships 10 built-in behaviors:
BHV-001Test Cheating DetectionBHV-002Refactor ComplexityBHV-003Error HandlingBHV-004Loop DetectionBHV-005Context RetentionBHV-008Error Propagation ChainBHV-009Cross-Module Extract MethodBHV-010Integration Contract ChangeO-2.01Instruction AdherenceO-3.01Code Quality
The capability profile uses a 3-axis radar (Code Quality, Error Resilience, Instruction Resilience). Tier 2 multi-file behaviors (BHV-008/009/010) are included in the suite but parked from the profile axes pending further calibration.
GitHub Actions
- name: Install Janus Labs
run: pip install janus-labs
- name: Run mock suite
run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json
- name: Compare to baseline
run: janus-labs compare baseline.json current.json --format github
Requirements
- Python 3.12+
- Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic
Phoenix telemetry is optional and requires Python <3.14:
pip install -r requirements-phoenix.txt
Contributing
See CONTRIBUTING.md.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file janus_labs-1.0.0.tar.gz.
File metadata
- Download URL: janus_labs-1.0.0.tar.gz
- Upload date:
- Size: 194.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
616ee3b2efed3283e13c739d2bbcd9561467b7e8bd2df7408d79db784e14bdf9
|
|
| MD5 |
db7014836169725899f1dcc383f48b19
|
|
| BLAKE2b-256 |
b91cec9cf925be2e3cdf27ec810b1e68b0dfe18109e132544dea78287a0b2080
|
File details
Details for the file janus_labs-1.0.0-py3-none-any.whl.
File metadata
- Download URL: janus_labs-1.0.0-py3-none-any.whl
- Upload date:
- Size: 200.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43761693bc7437c2b9b66cdd7f9dd51e6bc8369ca76d6351ee9b4041812ce47f
|
|
| MD5 |
e7b50e998581a98f60f31a2502d4e4c7
|
|
| BLAKE2b-256 |
871e9bdf0c284986c5064649241fc261fd1271fdb7a9a827b55ed7a78bffabc8
|