Benchmark governed vs raw Claude on your codebase
Project description
axor-benchmarks
Benchmark governed (axor) vs raw Claude on your codebase.
Measures real token savings, latency, and federation across 4 benchmark suites on any Python project.
Installation
pip install axor-benchmarks
Quick Start
cd ~/my-project
axor-bench
Output:
axor benchmark results
repo: ~/my-project
file: src/auth.py
task raw tokens governed savings bar policy
─────────────────────────────────────────────────────────────────────────────────
write_test 1,842 1,203 -34.7% ████████░░░░░░░░ focused_generative
explain_function 1,105 891 -19.4% ███░░░░░░░░░░░░░ focused_readonly
find_bugs 1,290 978 -24.2% ████░░░░░░░░░░░░ focused_readonly
─────────────────────────────────────────────────────────────────────────────────
TOTAL 4,237 3,072 -27.5% ████░░░░░░░░░░░░
insights
→ Token reduction: 27.5% (4,237 → 3,072 tokens)
→ Most used policy: focused_readonly (2 tasks)
Authentication
Priority order (highest to lowest):
--api-key sk-ant-...flagANTHROPIC_API_KEYenv var~/.axor/config.toml(set viaaxor claude → /auth)
# Use env var
ANTHROPIC_API_KEY=sk-ant-... axor-bench
# Use flag (not saved)
axor-bench --api-key sk-ant-...
# Use saved key from axor-cli
axor claude # → /auth → saves to ~/.axor/config.toml
axor-bench # reads automatically
Suites
| Suite | Tasks | What it measures |
|---|---|---|
quick |
1 task | Fast sanity check (~30s) |
small |
3 tasks | Single-turn focused tasks |
large |
2 tasks | Multi-tool, multi-step tasks |
conversation |
1 × 10 turns | Context growth over long sessions |
federation |
1 task | Child agent spawning + isolation |
full |
all | Complete benchmark (~5-10 min) |
axor-bench --suite small # fast
axor-bench --suite full # complete
axor-bench --suite conversation # test context compression
axor-bench --suite federation # test child agents
Options
axor-bench [options]
--api-key KEY Anthropic API key
--repo PATH Repo to benchmark (default: current dir)
--file PATH Specific file to use as context
--suite SUITE quick | small | large | conversation | federation | full
--no-raw Skip raw Claude baseline (governed only)
--output FORMAT table (default) | json
What is measured
Raw Claude — direct Anthropic API call with no governance:
- Full conversation history passed every turn
- No context compression
- No policy selection
- No tool governance
Governed (axor) — same task via GovernedSession:
- Dynamic policy based on task (focused_readonly, moderate_mutative, etc.)
- Context shaped and compressed per turn
- Waste elimination (dedup, error collapse, prose summarization)
- Session-scoped cache (no re-reading same file twice)
Token savings = (raw - governed) / raw × 100%
Positive = governed uses fewer tokens (expected for most tasks). Negative = governed uses more (possible for very simple tasks where overhead > savings).
Requirements
- Python 3.11+
axor-core >= 0.1.0axor-claude >= 0.1.0anthropic >= 0.40.0
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file axor_benchmarks-0.1.1.tar.gz.
File metadata
- Download URL: axor_benchmarks-0.1.1.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20c8240bbc9b1887599568dc5dd157ef1253e7ef74768d67c895ed201a0040cc
|
|
| MD5 |
b9fecf2a52c6ddbc06869f31f4f06340
|
|
| BLAKE2b-256 |
55931cb02b59368c5cd63003f9fd247b8985f18dd8ecf810a27476bcd59a40f8
|
Provenance
The following attestation bundles were made for axor_benchmarks-0.1.1.tar.gz:
Publisher:
ci.yml on Bucha11/axor-benchmarks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axor_benchmarks-0.1.1.tar.gz -
Subject digest:
20c8240bbc9b1887599568dc5dd157ef1253e7ef74768d67c895ed201a0040cc - Sigstore transparency entry: 1293602383
- Sigstore integration time:
-
Permalink:
Bucha11/axor-benchmarks@213738ecb81bd0a95f3382ae33070d037781a5d7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Bucha11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@213738ecb81bd0a95f3382ae33070d037781a5d7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file axor_benchmarks-0.1.1-py3-none-any.whl.
File metadata
- Download URL: axor_benchmarks-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31ead74bb10ea010cca3bc5b148e2faca0d05eb7c77f132b91dabda233c2595b
|
|
| MD5 |
ccea35ed49ab04489df91c976937b702
|
|
| BLAKE2b-256 |
636a507cb7a84e748fb38711e502984852655dbdd7ef3aa593827e993369e478
|
Provenance
The following attestation bundles were made for axor_benchmarks-0.1.1-py3-none-any.whl:
Publisher:
ci.yml on Bucha11/axor-benchmarks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axor_benchmarks-0.1.1-py3-none-any.whl -
Subject digest:
31ead74bb10ea010cca3bc5b148e2faca0d05eb7c77f132b91dabda233c2595b - Sigstore transparency entry: 1293602399
- Sigstore integration time:
-
Permalink:
Bucha11/axor-benchmarks@213738ecb81bd0a95f3382ae33070d037781a5d7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Bucha11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@213738ecb81bd0a95f3382ae33070d037781a5d7 -
Trigger Event:
push
-
Statement type: