Benchmark governed vs raw Claude on your codebase
Project description
axor-benchmarks
Benchmark governed (axor) vs raw Claude on your codebase.
Measures real token savings, latency, and federation across 4 benchmark suites on any Python project.
Safety note: the raw baseline is intentionally ungoverned. It can write files and run shell commands while benchmarking. Use
--no-rawoutside disposable workspaces or trusted repositories.
Installation
pip install axor-benchmarks
Quick Start
cd ~/my-project
axor-bench
Runs quick suite (1 small task, ~30s). Use --suite full for complete benchmark.
Authentication
Priority order (highest to lowest):
--api-key sk-ant-...flagANTHROPIC_API_KEYenv var~/.axor/config.toml(set viaaxor claude → /auth)
# Use env var
ANTHROPIC_API_KEY=sk-ant-... axor-bench
# Use flag (not saved)
axor-bench --api-key sk-ant-...
# Use saved key from axor-cli
axor claude # → /auth → saves to ~/.axor/config.toml
axor-bench # reads automatically
Suites
| Suite | Tasks | What it measures |
|---|---|---|
quick |
1 task | Fast sanity check (~30s) |
small |
3 tasks | Single-turn focused tasks |
large |
2 tasks | Multi-tool, multi-step tasks |
conversation |
1 × 10 turns | Context growth over long sessions |
federation |
1 task | Child agent spawning + isolation |
full |
all | Complete benchmark (~5-10 min) |
axor-bench --suite small # fast
axor-bench --suite full # complete
axor-bench --suite conversation # test context compression
axor-bench --suite federation # test child agents
Options
axor-bench [options]
--api-key KEY Anthropic API key
--repo PATH Repo to benchmark (default: current dir)
--file PATH Specific file to use as context
--suite SUITE quick | small | large | conversation | federation | full
--model MODEL Model ID for both runners (default: claude-sonnet-4-5)
--no-raw Skip raw Claude baseline (governed only)
--delay SECONDS Pause between tasks to avoid rate limits (default: 0)
--output FORMAT table (default) | json
Results (claude-sonnet-4-5, full suite)
Benchmark ran against axor-cli/axor_cli/auth.py (~340 LOC) from the axor monorepo.
| Task | Suite | Raw tokens | Governed | Savings | Policy |
|---|---|---|---|---|---|
| write_test | small | 8,693 | 10,855 | -24.9% | focused_generative |
| explain_function | small | 3,022 | 2,901 | +4.0% | focused_readonly |
| find_bugs | small | 3,265 | 3,251 | +0.4% | focused_generative |
| refactor_module | large | 97,370 | 66,005 | +32.2% | moderate_mutative |
| add_error_handling | large | 19,663 | 19,391 | +1.4% | focused_generative |
| iterative_review | conversation (10 turns) | 117,465 | 70,298 | +40.2% | focused_generative |
| parallel_analysis | federation | — | 17,698 | — | preset:federated |
| TOTAL | 249,478 | 172,701 | +30.8% |
Key insights:
- 30.8% total token reduction (249K → 173K tokens)
- 40.2% savings on multi-turn conversation — context compression effect grows with session length
- 32.2% savings on large refactoring — caching and dedup reduce repeated file reads
- Small single-turn tasks show near-zero or negative savings — governance overhead > compression benefit on short tasks
- Federation task (17.7K tokens) ran with
federatedpolicy preset
Small tasks (write_test, explain, find_bugs) show minimal savings because the context is not yet large enough for compression to outweigh governance overhead. Savings become significant at 10K+ tokens (large/conversation tasks).
What is measured
Raw Claude — direct Anthropic API call with no governance:
- Full conversation history passed every turn
- No context compression
- No policy selection
- No tool governance
Governed (axor) — same task via GovernedSession:
- Dynamic policy based on task (focused_readonly, moderate_mutative, etc.)
- Context shaped and compressed per turn
- Waste elimination (dedup, error collapse, prose summarization)
- Session-scoped cache (no re-reading same file twice)
Token savings = (raw - governed) / raw × 100%
Positive = governed uses fewer tokens (expected for most tasks). Negative = governed uses more (possible for very simple tasks where overhead > savings).
Requirements
- Python 3.11+
axor-core>= 0.5.0axor-claude>= 0.2.0anthropic >= 0.40.0
Ecosystem
| Package | Role |
|---|---|
axor-core |
Governance kernel — what this package benchmarks |
axor-claude |
Claude adapter — used for both raw and governed runs |
axor-cli |
Governed terminal runtime |
axor-langchain |
LangChain governance middleware |
axor-classifier-simple |
ML task signal derivation (optional) |
axor-memory-sqlite |
Cross-session memory (SQLite) |
axor-telemetry |
Privacy-preserving governance feedback |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file axor_benchmarks-0.3.0.tar.gz.
File metadata
- Download URL: axor_benchmarks-0.3.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c34f8a2a808e1121fa3871290f64e0d561ebb95f23862c1046aa9f7edb958af7
|
|
| MD5 |
617390bd40f98ea74a80dee63a1b86cb
|
|
| BLAKE2b-256 |
3b35375899d2a37371d449621f466e6c395e653547dcda9dc6e5071a975cb01a
|
Provenance
The following attestation bundles were made for axor_benchmarks-0.3.0.tar.gz:
Publisher:
ci.yml on Bucha11/axor-benchmarks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axor_benchmarks-0.3.0.tar.gz -
Subject digest:
c34f8a2a808e1121fa3871290f64e0d561ebb95f23862c1046aa9f7edb958af7 - Sigstore transparency entry: 1630257402
- Sigstore integration time:
-
Permalink:
Bucha11/axor-benchmarks@a98101d15e49da848a89cf58619563e8d9a7cd17 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Bucha11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@a98101d15e49da848a89cf58619563e8d9a7cd17 -
Trigger Event:
push
-
Statement type:
File details
Details for the file axor_benchmarks-0.3.0-py3-none-any.whl.
File metadata
- Download URL: axor_benchmarks-0.3.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c273360bf9ecc052597911b8040f1f664d358efb99711414e4c19bd20141414
|
|
| MD5 |
86b1efc868af8563d9bf34a0d92aeb51
|
|
| BLAKE2b-256 |
3f9e0db29bd44099488956260445ee4e1754b0977d3761c532b4e954912fcdf5
|
Provenance
The following attestation bundles were made for axor_benchmarks-0.3.0-py3-none-any.whl:
Publisher:
ci.yml on Bucha11/axor-benchmarks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axor_benchmarks-0.3.0-py3-none-any.whl -
Subject digest:
1c273360bf9ecc052597911b8040f1f664d358efb99711414e4c19bd20141414 - Sigstore transparency entry: 1630257442
- Sigstore integration time:
-
Permalink:
Bucha11/axor-benchmarks@a98101d15e49da848a89cf58619563e8d9a7cd17 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Bucha11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@a98101d15e49da848a89cf58619563e8d9a7cd17 -
Trigger Event:
push
-
Statement type: