Skip to main content

Benchmark governed vs raw Claude on your codebase

Project description

axor-benchmarks

CI PyPI Python License: MIT

Benchmark governed (axor) vs raw Claude on your codebase.

Measures real token savings, latency, and federation across 4 benchmark suites on any Python project.

Safety note: the raw baseline is intentionally ungoverned. It can write files and run shell commands while benchmarking. Use --no-raw outside disposable workspaces or trusted repositories.


Installation

pip install axor-benchmarks

Quick Start

cd ~/my-project
axor-bench

Runs quick suite (1 small task, ~30s). Use --suite full for complete benchmark.


Authentication

Priority order (highest to lowest):

  1. --api-key sk-ant-... flag
  2. ANTHROPIC_API_KEY env var
  3. ~/.axor/config.toml (set via axor claude → /auth)
# Use env var
ANTHROPIC_API_KEY=sk-ant-... axor-bench

# Use flag (not saved)
axor-bench --api-key sk-ant-...

# Use saved key from axor-cli
axor claude    # → /auth  →  saves to ~/.axor/config.toml
axor-bench     # reads automatically

Suites

Suite Tasks What it measures
quick 1 task Fast sanity check (~30s)
small 3 tasks Single-turn focused tasks
large 2 tasks Multi-tool, multi-step tasks
conversation 1 × 10 turns Context growth over long sessions
federation 1 task Child agent spawning + isolation
full all Complete benchmark (~5-10 min)
axor-bench --suite small          # fast
axor-bench --suite full           # complete
axor-bench --suite conversation   # test context compression
axor-bench --suite federation     # test child agents

Options

axor-bench [options]

  --api-key KEY       Anthropic API key
  --repo PATH         Repo to benchmark (default: current dir)
  --file PATH         Specific file to use as context
  --suite SUITE       quick | small | large | conversation | federation | full
  --model MODEL       Model ID for both runners (default: claude-sonnet-4-5)
  --no-raw            Skip raw Claude baseline (governed only)
  --delay SECONDS     Pause between tasks to avoid rate limits (default: 0)
  --output FORMAT     table (default) | json

Results (claude-sonnet-4-5, full suite)

Benchmark ran against axor-cli/axor_cli/auth.py (~340 LOC) from the axor monorepo.

Task Suite Raw tokens Governed Savings Policy
write_test small 8,693 10,855 -24.9% focused_generative
explain_function small 3,022 2,901 +4.0% focused_readonly
find_bugs small 3,265 3,251 +0.4% focused_generative
refactor_module large 97,370 66,005 +32.2% moderate_mutative
add_error_handling large 19,663 19,391 +1.4% focused_generative
iterative_review conversation (10 turns) 117,465 70,298 +40.2% focused_generative
parallel_analysis federation 17,698 preset:federated
TOTAL 249,478 172,701 +30.8%

Key insights:

  • 30.8% total token reduction (249K → 173K tokens)
  • 40.2% savings on multi-turn conversation — context compression effect grows with session length
  • 32.2% savings on large refactoring — caching and dedup reduce repeated file reads
  • Small single-turn tasks show near-zero or negative savings — governance overhead > compression benefit on short tasks
  • Federation task (17.7K tokens) ran with federated policy preset

Small tasks (write_test, explain, find_bugs) show minimal savings because the context is not yet large enough for compression to outweigh governance overhead. Savings become significant at 10K+ tokens (large/conversation tasks).


What is measured

Raw Claude — direct Anthropic API call with no governance:

  • Full conversation history passed every turn
  • No context compression
  • No policy selection
  • No tool governance

Governed (axor) — same task via GovernedSession:

  • Dynamic policy based on task (focused_readonly, moderate_mutative, etc.)
  • Context shaped and compressed per turn
  • Waste elimination (dedup, error collapse, prose summarization)
  • Session-scoped cache (no re-reading same file twice)

Token savings = (raw - governed) / raw × 100%

Positive = governed uses fewer tokens (expected for most tasks). Negative = governed uses more (possible for very simple tasks where overhead > savings).


Requirements


Ecosystem

Package Role
axor-core Governance kernel — what this package benchmarks
axor-claude Claude adapter — used for both raw and governed runs
axor-cli Governed terminal runtime
axor-langchain LangChain governance middleware
axor-classifier-simple ML task signal derivation (optional)
axor-memory-sqlite Cross-session memory (SQLite)
axor-telemetry Privacy-preserving governance feedback

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

axor_benchmarks-0.3.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

axor_benchmarks-0.3.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file axor_benchmarks-0.3.0.tar.gz.

File metadata

  • Download URL: axor_benchmarks-0.3.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for axor_benchmarks-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c34f8a2a808e1121fa3871290f64e0d561ebb95f23862c1046aa9f7edb958af7
MD5 617390bd40f98ea74a80dee63a1b86cb
BLAKE2b-256 3b35375899d2a37371d449621f466e6c395e653547dcda9dc6e5071a975cb01a

See more details on using hashes here.

Provenance

The following attestation bundles were made for axor_benchmarks-0.3.0.tar.gz:

Publisher: ci.yml on Bucha11/axor-benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file axor_benchmarks-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: axor_benchmarks-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for axor_benchmarks-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c273360bf9ecc052597911b8040f1f664d358efb99711414e4c19bd20141414
MD5 86b1efc868af8563d9bf34a0d92aeb51
BLAKE2b-256 3f9e0db29bd44099488956260445ee4e1754b0977d3761c532b4e954912fcdf5

See more details on using hashes here.

Provenance

The following attestation bundles were made for axor_benchmarks-0.3.0-py3-none-any.whl:

Publisher: ci.yml on Bucha11/axor-benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page