Skip to main content

Adaptive test-time-compute routing for LLM reasoning: cheap samples first, escalate to native thinking only on disagreement.

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

ZEN — Sample · Agree · Escalate

PyPI Python Tests License: MIT

Adaptive test-time compute for LLMs: stop paying for thinking your model doesn't need.

ZEN routes every request through the cheapest path that can solve it. It draws two cheap (non-thinking) samples first and only escalates to expensive native thinking when they disagree — using disagreement as a free difficulty signal. On hard benchmarks it matches or beats native-thinking accuracy at lower token cost; on easy traffic it answers for a fraction of the price.

pip install zen-router
from zen import ZenGateway

gw = ZenGateway()                       # any OpenAI-compatible endpoint
resp = gw.route("What is 15% of 240?")  # classifies, routes, answers

print(resp.text)     # "36"
print(resp.path)     # "consensus"  (solved cheaply — no thinking spent)
print(resp.tokens)   # ~220

No training. No GPU. No logprobs required. Provider-agnostic.

Why

Thinking/reasoning modes are powerful and expensive — and they charge you for every hidden reasoning token. A trivial question costs 10× more with thinking enabled (we measured 6 vs 66 completion tokens for 24*17). Chat apps today either burn that on every message or make the user toggle thinking by hand. ZEN makes the decision per request, automatically, with an auditable token account for every call.

How it works

request ──> 2 cheap samples (parallel, thinking off)
                │
       agree? ──┴── yes ──> answer                     (~4k tokens on AIME)
                │
                no   (= this one is actually hard)
                │
                ▼
       3rd cheap sample + 1 native-thinking sample     (parallel)
                │
                ▼
       weighted vote {cheap ×1 each, native ×2} ──> answer   (~19k tokens)

Benchmarks

DeepSeek-V4-Flash via OpenRouter, single-sample protocol, tokens counted across all calls each method makes. N=30 per benchmark — treat ±9pp as noise. Full tables and methodology: docs/RESULTS.md.

AIME 2025 (hard for the model — native thinking spends ~15k tokens/problem):

method accuracy mean tokens
single cheap call 40.0% 2.0k
self-consistency@10 43.3% 18.2k
best-of-5 + LLM judge 50.0% 13.9k
native thinking (1 call) 56.7% 14.6k
ZEN (vote) 56.7–66.7% (2 runs) 12.8k

AIME 2024 (easy for the model — native thinking self-regulates to ~9k):

method accuracy mean tokens
native thinking (1 call) 76.7% 9.2k
ZEN (vote) 73.3% 9.8k

Honest reading: ZEN wins when the task challenges the model (cuts waste, adds vote robustness). When the task is easy for the model, it is a statistical tie with slight overhead — modern thinking modes already self-regulate. Rule of thumb from the data: ZEN pays off when ≥ ~35–55% of your traffic is cheaply solvable.

The three layers

layer what it does use it for
ZenGateway classifies each message (chat / question / task) and dispatches the right amount of compute chat apps, AI workspaces — an automatic thinking mode
ZenRouter cheap consensus → thinking escalation, for verifiable answers math, MCQ, facts, extraction, code-with-tests
ZenPlanner plan-and-execute: decompose, run steps with threaded context, synthesize multi-step tasks, agent pipelines

ZenGateway — the automatic thinking mode

from zen import ZenGateway

gw = ZenGateway()
gw.route("hey, what do you think about coffee?")   # chat  -> 1 cheap call
gw.route("Which planet has the most moons?")       # question -> consensus route
gw.route("Compare SQLite and PostgreSQL and recommend one.")  # task -> planner
gw.route(msg, kind="question")                     # or force the kind yourself

Tool-safety rule (built in): pass tools_present=True on turns where the model may call tools. The gateway then makes exactly one routed call — it never samples in parallel around side-effectful tools (two samples would run your tools twice). Your agent loop handles the tool cycle around it.

ZenRouter — verifiable Q→A

from zen import ZenRouter

router = ZenRouter()                    # math-style prompt/parser by default
result = router.solve("If 3x + 7 = 22, what is x?")
result.answer, result.tokens, result.path   # 5, ~4k, "consensus"

ZenRouter(
    variant="vote",          # "vote" (validated best) | "eager" | "hybrid"
    temperature=0.7,         # sampling diversity for consensus
    native_weight=2,         # native sample's weight in the final vote
    think_budget=16000,      # completion budget of the native call
    parser=my_extractor,     # swap the answer parser for your domain
    raw_log="samples.jsonl", # dump every sample for offline analysis
)

ZenPlanner — plan-and-execute with an agent hook

from zen import ZenPlanner

def my_agent_step(step_description, context):
    # run your tool-calling agent (e.g. SIFT) on this step, return text
    return my_agent.run(step_description, context)

planner = ZenPlanner(executor=my_agent_step)   # omit executor = pure reasoning
result = planner.run("Research X, compare with Y, write a recommendation.")
result.text     # final deliverable
result.steps    # [(step, result), ...]

The planner fixes the classic plan-and-execute inefficiency: each step receives the task + plan + clipped results of prior steps — not the full reasoning history — so cost grows linearly, not quadratically. A step that fails is retried once with native thinking (per-step escalation). Plans with fewer than two steps skip orchestration entirely.

Configuration

Point ZEN at any OpenAI-compatible endpoint:

export ZEN_LLM_BASE_URL=https://openrouter.ai/api/v1
export ZEN_LLM_MODEL=deepseek/deepseek-v4-flash
export ZEN_LLM_API_KEY=sk-or-...

or use a built-in profile (reads the key from OPENROUTER_API_KEY or a git-ignored .secrets.json):

from zen import config
config.apply_profile("deepseek-v4-flash")

Escalation needs a model exposing a thinking toggle (DeepSeek, Qwen, GPT reasoning-effort, Claude extended thinking...). Models without one still work — ZEN then behaves as consensus routing.

Everything is injectable — client_factory=, classifier=, parser=, executor= — so ZEN stays provider-agnostic and fully testable offline:

python tests/test_offline.py    # 31 tests, zero API calls

Negative results we kept (so you don't rediscover them)

  • Confidence gating (accept a high-logprob first sample): logprob coverage through OpenRouter is partial (~47%) and confidence did not predict correctness on AIME — the non-thinking model is often confidently wrong. Available as gate_tail_logprob=, off by default.
  • Halving the native think budget (think_budget=8000): saved only ~10% of total tokens and cost accuracy exactly where thinking was needed.
  • Truncating candidates in judge/aggregation prompts silently destroys them — keep the head and the tail (zen.parsing.clip).

Honest caveats: results come from one model family and N=30 math benchmarks; consensus needs comparable short answers (v0.2 targets verifiable outputs — swap parser= for your domain, open-ended text is future work).

Repo layout

zen/           the package: gateway, router, planner, client, parsing, config
tests/         offline test suite (no API key needed)
experiments/   evaluation harnesses (AIME/MATH/GSM8K) + the RL research line
docs/          full results tables + development log

Related work

ZEN distills ideas from self-consistency (Wang et al.), adaptive-consensus and fast/slow-routing research (AdaptThink, DART), DeepConf, RSA and RL of Thoughts — the tiny-controller idea that started this project. ZEN's contribution is the disagreement-routed cheap→thinking cascade with honest, per-call token accounting.

Works well next to SIFT (tool retrieval and calling by the same author): SIFT decides what the model can do, ZEN decides how hard it should think.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zen_router-0.2.0.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zen_router-0.2.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file zen_router-0.2.0.tar.gz.

File metadata

  • Download URL: zen_router-0.2.0.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zen_router-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b9393334b869b030531c84f89eff85e08e47602800186359b35f4829612cd040
MD5 58b1fe40050f34a866c2bf099511a879
BLAKE2b-256 a8a06c1b23c074dc86cd34b775bb3fae82d1f24b0a674bc685fa923774909db3

See more details on using hashes here.

Provenance

The following attestation bundles were made for zen_router-0.2.0.tar.gz:

Publisher: publish.yml on Victor-Alves0/ZEN

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zen_router-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: zen_router-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zen_router-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48169f3c15b234166bcb7760cab496bc241eb01a586f086371770fc8f3659bb7
MD5 de238e6a34d188987875ecb2d8c5c26e
BLAKE2b-256 cee2bd8c888eac8dd64ab6f8710c03b2d0f480182dcb1770d537b085198bf3a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for zen_router-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Victor-Alves0/ZEN

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page