Skip to main content

A quality-gated local-model router for coding harnesses — local where it's proven, cloud where it isn't, verified with automatic fallback.

Project description

anvil-serving — the quality-gated local-model router for coding harnesses

anvil-serving

The quality-gated local-model router for coding harnesses.

Local where it's been proven, cloud where it hasn't — verified, with automatic fallback.

License: MIT Version Docs Marketplace Tests

Point your coding harness (Claude Code, Codex, Aider, pi, Cline, Continue — OpenClaw as the near-first-class beachhead) at one anvil-serving endpoint. Per request, the router resolves an intent to a tier — fast-local, heavy-local, or cloud — using a measured per-(model, work-class) quality profile, cheaply verifies the output, and falls back to the next tier (ultimately cloud) when the local answer fails. The harness sees one reliable endpoint and never silently eats a local-quality failure mid-run.

The router is built on a serving substrate that already exists in this repo — usage profiling, model cataloging, tuned deployment, a correctness gate, capacity benchmarking, and a single-GPU model multiplexer. Those tools (documented below) right-size and stand up the local tiers; the router decides what each one is trusted to serve.


Why a router, and not just another proxy

Transport is a commodity — LiteLLM, claude-code-router, Ollama, OpenRouter all move tokens. None of them know whether local can actually do this work. They route by static rules (model name, cost, regex). On anvil's real PRD→tasks planning prompt, the gap was measured directly:

  • Local output is structurally valid ≥92% of the time (5 of 6 outputs scored 100% — parses cleanly under the strict schema, no cycles, no dangling edges) — structural validity is not the differentiator.
  • But on dependency/ordering reasoning local collapses: blind-judge totals were frontier 24.75/25, fast 16.0, heavy 13.25 (local ≈ 55–65% of frontier), with the gap squarely in dependency correctness (frontier 5.0/5 vs local ~2/5).

A dumb proxy sends that planning request to local and silently corrupts a long agent run. The defensible asset is therefore not the transport — it's the quality profile (per model × work-class, measured on the operator's own workload) plus the verify-and-fallback loop. Competitors can copy transport in a weekend; they can't copy "we measured that gpt-oss-20b is safe for bounded edits but unsafe for dependency planning on your repos."

Two more decisions fell out of the research and shaped the design:

  • The integration point is the harness/runtime, not a ledger. An audit of the anvil state engine confirmed it is not an LLM gateway — it exposes a single custom_base_url for optional planning augmentation, with no router and no two-tier endpoint routing. So the router lives where the agent traffic actually flows: in front of the harness.
  • The model field is the one routing channel unmodified harnesses expose. It is required in both wire schemas, forwarded verbatim, and free-form — so "named presets in the model field" is the right wire surface.

How it works

Intent presets in the model field

Callers declare an intent — a closed enum of named presets — instead of a model name. The router owns intent → (model, tier, params):

planning   quick-edit   review   chat   long-context

Accepted bare (planning) or namespaced (anvil/planning). Each preset resolves internally to hard constraints (context length, privacy=local-only, tool/structured-output support, cost ceiling) that filter the candidate pool, plus a quality intent that ranks the survivors via the profile. Filter, then rank. A model:-pin escape hatch stays available for repro and debugging. /v1/models advertises the preset vocabulary so presets surface in harness model pickers, and responses stay transparent — the response reports the real tier that served.

The graceful-degradation tier ladder

Intent resolution degrades to the highest tier the originating harness can reach. The classifier is the universal floor, because most requests arrive on a single session model string with no declared intent:

Tier Mechanism Available on
0 — Infer classify work-class from the raw payload (token count, thinking flag, tool types, image content, system-prompt fingerprint) — per-request intent with no caller cooperation every harness that reaches the endpoint
1 — Named presets in model caller/config sets a preset token; router maps preset → tier Claude Code, Codex, Aider, pi, Cline, Continue — not Cursor/Amp/Devin
2 — extra_body / header dimensions optional structured hints (budget, latency, verifier policy) Codex, Continue; Aider (config)
3 — Native intent field a first-class per-request intent field none today (needs a standard/harness change)

Work-class taxonomy v0: chat/Q&A, bounded-edit, multi-file-refactor, planning/decomposition, review/critique, long-context-retrieval. Ambiguous classifications bias toward the safer/cloud tier and are logged for calibration.

Verify-and-fallback

flowchart LR
    REQ([request]) --> RES["resolve intent<br/>(preset or Tier-0 infer)"]
    RES --> POL{"quality profile<br/>model x work-class"}
    POL -->|allow| LOCAL["local tier<br/>(fast or heavy)"]
    POL -->|"allow-with-verify"| LV[local tier]
    LV --> VER{"structural verify"}
    VER -->|pass| RET([200 to harness])
    LOCAL --> RET
    VER -->|fail| MISS[all candidates exhausted]
    POL -->|deny| MISS
    MISS --> GATE{"cloud tier<br/>opt-in configured?"}
    GATE -->|"no, keyless default"| SOS["503 exhaustion — no local tokens<br/>gateway failover to native provider<br/>(flat-rate, zero metered by anvil)"]
    GATE -->|"yes, opt-in keyed"| CLOUD["cloud tier<br/>(metered, explicit opt-in only)"]
    CLOUD --> RET
    classDef gate fill:#0b3b40,stroke:#23b6c4,color:#7fe9f0;
    classDef cloud fill:#16365e,stroke:#58a6ff,color:#cfe6ff;
    classDef local fill:#0b2b0b,stroke:#23c430,color:#7ff09a;
    classDef escape fill:#3b1800,stroke:#c47823,color:#f0c887;
    class POL,VER,GATE gate;
    class CLOUD cloud;
    class LOCAL,LV local;
    class SOS escape;

Most "quality control" is routing done ahead of time (never send a deny work-class to local). Verification is a cheap safety net, tiered:

  1. Prevent — the profile's deny decisions keep risky classes (e.g. dependency planning) off local entirely. Free.
  2. Cheap structural verify (inline) — near-zero-cost checks that caught real failures in the eval: empty/truncated content (thinking-budget starvation), tool-call JSON that doesn't validate, code that doesn't parse, a diff that doesn't apply.
  3. Confidence signals — logprob/entropy thresholds, refusal markers where available.
  4. Async LLM-judge (off the hot path) — sampled cloud grading that feeds the profile, never a blocking gate.

On verify-fail / error / timeout / low-confidence the router retries up the tier chain (fast → heavy → cloud), with retry caps, circuit breakers, and a per-session cost budget. For fail-prone classes on the streaming path, a non-streamed commit window buffers and verifies before the first byte reaches the harness, so a local miss never delivers partial tokens. Every fallback is logged as a profile signal.

Default: local-only / $0 metered cloud

The default config (configs/example.toml) routes local-only — anvil holds no cloud API key and incurs $0 metered API billing. The "cloud tier" in the diagram above operates in two modes:

  • Keyless (default): no cloud tier is configured. On a local verify-failure, all candidates are exhausted and anvil returns an exhaustion_status (503 by default, configurable) with nothing local committed. A gateway like OpenClaw treats this as a transport failure and re-routes the request on its native subscription provider — flat-rate, not metered.
  • Opt-in keyed: when an operator explicitly adds a cloud tier via configs/example-with-cloud.toml, verify-failures escalate internally to the cloud tier and return 200. This is metered — every cloud-routed request bills per token against the API key.

Billing warning: the opt-in cloud tier routes through a metered API key (ANTHROPIC_API_KEY, etc.), not your flat-rate subscription. Per-token charges apply to every request the cloud tier serves. The per-intent [router].metered_cloud map controls which work-classes are ever eligible for the cloud tier — nothing is metered unless you explicitly list it there. See configs/example-with-cloud.toml and ADR-0001 for the full opt-in config and rationale.

POST /v1/route — the routing-brain endpoint. The decision is also queryable standalone, without serving the request: send a completions-shaped body, get back { tier, model, provider, work_class, reason, confidence, session_id }. The OpenClaw plugin uses this to route deny-class requests directly to the gateway's native provider — bypassing anvil entirely — and to send allow-class requests through. Status 200 (decision made, even if cloud), 400 (malformed), 503 (no suitable tier).

The quality profile (the moat)

The quality gate — work proven on a tier stays local ($0 metered); unproven work is deferred to the harness's own opt-in cloud subscription; measured per (model, work-class)

A table keyed (model, work-class) → {quality_score, sample_n, last_measured, decision} where decision ∈ {allow, allow-with-verify, deny}. Hand-seeded for the MVP; later bootstrapped from the shadow-eval harness (the planning eval generalized to arbitrary work-classes), right-sized from measured usage via profile, and continuously calibrated from sampled production traffic graded off the hot path. Keyed on a serve fingerprint (model + quant + engine + serve flags) so a quant/engine swap marks affected rows stale and triggers re-measure.

Architecture

anvil-serving routing pipeline — local-first across fast-local / heavy-local ($0 metered); cloud is the harness's own, opt-in

          ┌──────────────────── anvil-serving router ───────────────────┐
harness → │ front door → resolve intent → route → [verify] → return     │ → harness
(CC/Codex)│  (Anthropic   (preset in        (filter by    │  on fail ↘          │
          │   +OpenAI      model field, else  constraints, │  fall back to       │
          │   dialects)    infer work-class)  rank by       │  next local tier    │
          │                                   quality profile)                   │
          └───────────────────────────────┬──────────────┴────────────────────┘
                                           ▼
                fast-local :30001   heavy-local :30000    · cloud is opt-in ·
                  (multiplexer-managed, free GPU)         on a local miss anvil returns 503;
                                                          the harness fails over to its OWN
                                                          cloud subscription (anvil holds no key)

The core stays protocol-standard (Anthropic Messages + OpenAI Chat Completions) with zero OpenClaw coupling. OpenClaw is the near-first-class beachhead because its before_model_resolve hook (fires per run — plausibly one user message; exact cadence is a documented validate-first gap) unlocks client-side per-request intent the closed harnesses can't do — but it ships as a thin, swappable adapter plugin, not a dependency. Focus, not couple. Full design: docs/QUALITY-GATED-ROUTER.md. OpenClaw spec (verdict: go-with-caveats, medium risk / API churn): docs/OPENCLAW-INTEGRATION-SPEC.md.


Install

Requires Python >=3.11 (the router's config loader uses stdlib tomllib, added in 3.11). Running an older interpreter prints a clear error and exits — see anvil_serving/cli.py.

pip install anvil-serving   # stdlib-only; no required deps
anvil-serving --help

pip install anvil-serving works once the package is published to PyPI. Until then (or for development), install from a clone: pip install -e ..

macOS — if pip install fights your system/Homebrew Python (externally-managed-environment, PATH shadowing), install with pipx instead — it gives anvil-serving its own isolated venv and puts the CLI on your PATH:

brew install pipx && pipx ensurepath
pipx install anvil-serving          # or, from a clone: pipx install -e .
anvil-serving --help

30-second quickstart

# 1) install
pip install anvil-serving            # or: pip install -e .  (from a clone)
#    macOS alternative: pipx install anvil-serving

# 2) start the router front door on 127.0.0.1:8000
anvil-serving serve --config configs/example.toml

# 3a) Claude Code: point it at the router (use 127.0.0.1, never localhost)
export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_MODEL="planning"        # an intent preset, sent verbatim in the model field

# 3b) or any OpenAI-compatible client: point its base URL at the router
export OPENAI_API_BASE="http://127.0.0.1:8000/v1"
# then use a preset token as the model id, e.g. "planning" / "quick-edit" / "chat"

Both the router front door (anvil-serving serve) and the serving substrate commands ship today.

Pointing a harness at the router (config recipes)

Use 127.0.0.1 in local URLs — on Windows, localhost can stall on an IPv6 lookup. Secrets are referenced by env-var name only; never paste a key into config.

Claude Code — intent preset per session slot:

export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_AUTH_TOKEN="$ANVIL_ROUTER_TOKEN"        # -> Authorization header
export ANTHROPIC_MODEL="planning"                        # main-loop intent, sent verbatim
export ANTHROPIC_DEFAULT_HAIKU_MODEL="quick-edit"        # background/utility intent
export CLAUDE_CODE_SUBAGENT_MODEL="review"               # subagent-class intent
export CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1       # optional: enumerate router presets

Aider — the preset rides in the model string (openai/ forces compat routing):

export OPENAI_API_BASE="http://127.0.0.1:8000/v1"
export OPENAI_API_KEY="$ANVIL_ROUTER_TOKEN"
aider --model openai/planning --editor-model openai/quick-edit --weak-model openai/chat

pi (Mario Zechner's coding agent) — custom providers are first-class in ~/.pi/agent/models.json; anvil's presets ride as free-form model ids (pick via /model or pi -m planning):

{ "providers": { "anvil": {
    "baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
    "apiKey": "$ANVIL_ROUTER_TOKEN",
    "models": [
      { "id": "planning",   "name": "anvil planning",   "contextWindow": 131072, "maxTokens": 16384 },
      { "id": "quick-edit", "name": "anvil quick-edit", "contextWindow": 131072, "maxTokens": 16384 },
      { "id": "review",     "name": "anvil review",     "contextWindow": 131072, "maxTokens": 16384 } ] } } }

Declare contextWindow as the largest routed tier's window — an understated value makes a harness clamp its completion budget (see docs/OPENCLAW-INTEGRATION-SPEC.md §2).

Cline / Continue.dev — select "OpenAI Compatible", Base URL http://127.0.0.1:8000/v1, Model (ID) = a preset token. Codex CLI — set base_url + model = "planning" in ~/.codex/config.toml. Cursor / Amp / Devin are out of scope — backend-mediated or backend-locked, so they can't be repointed at a self-hosted endpoint. Full recipes: docs/QUALITY-GATED-ROUTER.md (Appendix).

Run the router in Docker (supervised, auth-gated)

pip install anvil-serving (above) still works and remains the quickest way to run the router locally — Docker is an additional deployment option for running it as a supervised, network-facing service, restart-supervised alongside the local model serves. See ADR-0004 for the full design.

# 1) pick the token the router will require on every route except /healthz
export ANVIL_ROUTER_TOKEN="$(openssl rand -hex 32)"   # generate once, keep it secret

# 2) build the repo-root image (stdlib-only, non-root user, HEALTHCHECK on /healthz)
docker build -t anvil-serving .

# 3) run it, publishing ONLY 127.0.0.1 on the host
docker run -d --name anvil-router \
  -p 127.0.0.1:8000:8000 \
  -e ANVIL_ROUTER_TOKEN \
  -v "$(pwd)/configs/example.toml:/etc/anvil/config.toml:ro" \
  anvil-serving

# 4) point a harness at the authed router
export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_AUTH_TOKEN="$ANVIL_ROUTER_TOKEN"        # -> Authorization: Bearer header

Auth is opt-in, keyed off config, not off the container: it activates only when the config has a [server] table naming an env var to check —

# configs/example.toml (excerpt) -- turn auth on
[server]
auth_env = "ANVIL_ROUTER_TOKEN"   # env-var NAME only; never a literal token in this file

With no [server] table (the shipped configs/example.toml default), the container behaves exactly like the bare pip install path always has: no built-in auth, loopback-only by convention — full back-compat. When auth_env is set, the front door requires Authorization: Bearer <token> or x-api-key: <token> on every route, compared against os.environ[auth_env] with a constant-time check; GET /healthz stays unauthenticated always (it returns {"status":"ok"} and is what the image's HEALTHCHECK probes).

Compose, co-located with the local serves — the router reaches the GPU serves by Docker service name over a shared internal network; only the router is published beyond loopback:

services:
  router:
    build: .                                        # repo-root Dockerfile
    ports:
      - "${ROUTER_PUBLISH:-127.0.0.1}:8000:8000"
    environment:
      - ANVIL_ROUTER_TOKEN
    volumes:
      - ./configs/example.toml:/etc/anvil/config.toml:ro
    restart: unless-stopped
    depends_on: [sglang, fast]
  sglang:   # heavy tier -- internal only, not published beyond the router
    ...
  fast:     # fast tier -- internal only, not published beyond the router
    ...

With this topology the router config's tiers point at the serves by service name (http://sglang:30000/v1, http://fast:30001/v1) instead of 127.0.0.1. See examples/fakoli-dark/ for the full worked cross-box example — a gateway box's harness reaching a GPU box's containerized, token-authed router over a private network.


Security & exposure

The front door binds 127.0.0.1 by default. Auth is opt-in and off by default — with no [server] table (or no auth_env key inside it) in your config, the router behaves exactly as it always has: no built-in authentication, loopback-only by convention. Configuring [server].auth_env = "ANVIL_ROUTER_TOKEN" turns on a built-in bearer / x-api-key token check (constant-time compare against os.environ[auth_env]) on every route except GET /healthz — see ADR-0004 and the Docker section above.

A token is required before you expose the router beyond loopback. If you bind it publicly (--host 0.0.0.0) or otherwise put it on a reachable address, configure auth_env first — an open, unauthenticated endpoint lets any caller drive routing and, if you have configured an opt-in metered cloud tier, consume your cloud credentials. (The default local-only config carries no cloud key, but the risk still applies if you later add one.) Treat a mesh/tailnet ACL (e.g. Tailscale) as defense-in-depth on top of the token, never as a substitute for it. Cloud credentials are referenced by env-var name and redacted from logs; see SECURITY.md for the full threat model and how to report a vulnerability.


The serving substrate (the foundation the router builds on)

These commands ship today and are how you right-size, stand up, and validate the local tiers the router routes across. Stdlib-only; the hard-won gotchas (below) are baked in as defaults.

Ops prerequisite (substrate only): deploy and serves run the GPU model containers via Docker + Docker Compose v2 (docker compose …), on top of the NVIDIA container runtime. Model serves are Docker-Compose-defined so serves up is a drift-safe docker compose up -d (ADR-0002). This is an operational requirement for standing up local tiers — not a Python dependency: the router and the anvil-serving package stay stdlib-only (pip install pulls in nothing; the hot path adds no runtime deps).

anvil-serving profile      # parse ~/.claude logs -> usage baseline (context/gen/concurrency percentiles, role split)
anvil-serving models sync  # scan your HF caches -> card catalog + INDEX (GGUF vs safetensors, ctx, quant, license, thinking)
anvil-serving deploy       # render a tuned SGLang docker-compose for YOUR gpu + model
anvil-serving preflight    # correctness gate against any OpenAI-compatible endpoint (sm_120-aware)
anvil-serving benchmark    # replay YOUR measured request distribution (TTFT, throughput, prefix-cache hit)
anvil-serving multiplexer  # single-resident model swap on one GPU (multi-engine: SGLang + vLLM)

Substrate quickstart

# 1) understand your usage (-> the routing profile is right-sized from this)
anvil-serving profile --out-dir .

# 2) see what models you have and which a server can actually load
anvil-serving models sync --out ./model-library
#    -> ./model-library/INDEX.md  (the (yes/no) "SGLang-loadable" column is the one that saves you)

# 3) generate a deployment for a local model on GPU 1
anvil-serving deploy --model /path/to/model-dir --gpu 1 --context 131072 --served-name local-specialist
docker compose -f docker-compose.yml up -d

# 4) validate + benchmark the tier (use 127.0.0.1, never localhost, on Windows)
anvil-serving preflight --base-url http://127.0.0.1:30000/v1 --model local-specialist --needle-ctx 60000
anvil-serving benchmark --base-url http://127.0.0.1:30000/v1 --model local-specialist --burst 20

preflight is the router's correctness gate in microcosm — it's the same verify-before-trust philosophy applied to a single endpoint. multiplexer is the Backend seam that already exists: it manages the single-resident fast/heavy swap pair on one GPU behind one interface. (score and cache-prune are additional substrate helpers.)

What's baked in (the knowledge, not just code)

  • Load-time OOM fix: loading a model without mmap pulls it fully into RAM — on WSL2 the default ~50%-of-host cap OOM-kills the scheduler (SIGKILL/-9). Raise the WSL VM memory; the deploy uses --weight-loader-disable-mmap (fast sequential reads vs catastrophic mmap-over-virtiofs).
  • GGUF != SGLang. GGUF is llama.cpp-only; SGLang/vLLM need safetensors. The catalog flags this up front.
  • Thinking-by-default models (Qwen3.5 etc.) return empty content with a small max_tokens — they spend the budget reasoning. Disable per request with chat_template_kwargs:{enable_thinking:false}, or give >= 4096 tokens. See docs/MODEL-SETTINGS-EXAMPLE.md.
  • GPU pinning (device_ids) so one card serves while another stays free (gaming / second job).
  • Blackwell sm_120 caveats: some FP8 MoE paths hang on load; AWQ/compressed-tensors via Marlin works; NVFP4 large-prefill is still rough. Pre-flight before you trust throughput.

Worked example — fakoli-dark

examples/fakoli-dark/ is a real two-tier instance and the bake-off context for the router:

  • heavy :30000 — Qwen3.5-35B-A3B AWQ (served-model-name qwen35-awq-local) on SGLang, RTX PRO 6000 96GB.
  • fast :30001gpt-oss-20b on vLLM, RTX 5090 32GB.
  • gatewayFakoli Mini runs OpenClaw (already installed), the beachhead harness; the router sits between it and the serves — see examples/fakoli-dark/README.md for the cross-box exposure model (the gateway and the GPU box are separate machines) and every hardcoded value (GPU UUIDs, model paths, ports) you must replace with your own before reusing this topology.

It carries the actual compose files, the .wslconfig fix snapshot, the model index, the setup story (SETUP-STORY.md), the decisions log, and the BAKE-OFF-RUNBOOK.md (where local-planning failover now lives, after the router promotion).


Status

v0.7.2 is shipped. The relay forwards tools, tool history, and sampling parameters with full wire fidelity, streams real SSE deltas (true TTFT), and passes through real token usage — on top of the v0.6.0 containerized, token-authed service (ADR-0004) and the portable-by-default genericity work (ADR-0003); the harness-router PRD is complete — all 18 tasks built (milestones M0–M3), 993 tests green. v0.7.1/v0.7.2 are live-incident hardening passes: a caller-capped length/max_tokens stop (e.g. a harness-computed max_completion_tokens floored at 1) now passes structural verify instead of 503-exhausting and tripping the circuit breaker on every turn, and model weights mount from a named Docker volume instead of host bind mounts (9P/virtiofs bind mounts turned cold loads into 20–90 minute stalls — and this also removed the last machine-specific paths from the shipped package). v0.4.0 shipped advise-and-defer (local-only default, opt-in metered cloud) and the launch-hardening pass on top of the v0.3.0 harness-router. Both the router front door (anvil-serving serve) and the serving substrate (profile / models sync / deploy / preflight / benchmark / multiplexer) ship. See the CHANGELOG for the full release notes.

What shipped, by milestone:

  • M0 — front door + config: Anthropic + OpenAI dialects, SSE streaming, pass-through to one backend, tier-topology config schema. Drop-in for Claude Code.
  • M1 — intent + policy: Tier-0 classifier (the floor) and preset parsing from the model field, residency-aware routing policy over the multiplexer, /v1/models preset discovery, cloud-tier credentials on the Backend seam.
  • M2 — the wedge: cheap inline structural verify + verify-gated fallback (streaming commit window; cloud escalation is the opt-in keyed mode — the keyless default returns an exhaustion-503 for gateway handoff), transparent responses + decision log, the typed extension seams, the anvil-serving serve --config ... CLI verb, and the OpenClaw reference adapter plugin.
  • M3 — the moat: quality-profile bootstrap from the generalized shadow-eval, async opt-in calibration + serve-fingerprint staleness, validation on routed traffic, and the per-work-class promotion decision (planning/critic stay cloud-default, failover-only).

Known limitations

  • Keyless 503 handoff is not a safety net for local-preferred classes under OpenClaw. A live run confirmed OpenClaw's native failover does not escape a plugin-set providerOverride: the exhaustion-503 trips the failover, but every fallback re-resolves through anvil and 503s again (ADR-0005, an OpenClaw-side defect). Mitigations: move at-risk classes to cloud-preferred via ANVIL_CLOUD_CLASSES, or enable the opt-in metered cloud tier (the durable fix).
  • Most promotion verdicts are seed/expected. The shipped per-work-class promotion decisions are hand-seeded, pending real-traffic calibration; only planning rests on hard eval data (blind-judge scores: frontier 24.75/25, fast 16.0, heavy 13.25 — full data in the companion notes repo). Note the seed verdicts were measured against the models served at the time (heavy: qwen3-coder-30b); the reference heavy serve has since moved, and the 2026-07-02 review recommends re-running the shadow-eval against current-gen local models before trusting the planning → deny margin.
  • OpenClaw live validation remains a manual step. It has been run — it closed Gaps 1–3, found Gap 4 (ADR-0005) and the contextWindow misdeclaration incident — but each re-validation is a human step on the gateway box (examples/openclaw/README.md). The committed hook-fire-log.jsonl is a representative fixture, not a live capture.
  • The T017 traffic fixture is synthetic — traffic-metrics behavior is exercised against a synthetic fixture, not yet against real routed production traffic.
  • A caller-supplied max_tokens disables truncation-failure on that request. The v0.7.1 fix treats any explicit caller cap as compliance; since the Anthropic Messages dialect requires max_tokens, NotTruncated is effectively pass-through for Anthropic-dialect callers.

Reuse map

Capability Module Status
Right-size from real usage profile exists
Per-model serving facts / sane defaults models sync, analyze exists / designed
Bring up + on-demand model swap on one GPU multiplexer exists
Correctness gate preflight exists
Throughput / capacity measurement benchmark exists
Per-work-class quality measurement shadow-eval harness built + generalized
Front door + intent-resolve + route + verify + fallback router module shipped

Docs

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anvil_serving-0.7.2.tar.gz (279.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anvil_serving-0.7.2-py3-none-any.whl (266.2 kB view details)

Uploaded Python 3

File details

Details for the file anvil_serving-0.7.2.tar.gz.

File metadata

  • Download URL: anvil_serving-0.7.2.tar.gz
  • Upload date:
  • Size: 279.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anvil_serving-0.7.2.tar.gz
Algorithm Hash digest
SHA256 b5f3716bff455fc81fad8d402a2907416c925a5fbfa1ee637cdc80efbc20852a
MD5 85748b962a044b3a334b045227ee1248
BLAKE2b-256 aaa0f2af81a09cbb0e27b9243af5022d8f85660d9b5840980e0528f0ea4f1218

See more details on using hashes here.

Provenance

The following attestation bundles were made for anvil_serving-0.7.2.tar.gz:

Publisher: publish.yml on fakoli/anvil-serving

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file anvil_serving-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: anvil_serving-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 266.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anvil_serving-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2049589d0a0f56044fc09d1aef3d85d15b38a122741d4ef6855f8e921e4d4225
MD5 735eb1582f966449fdb3506e3a702a1a
BLAKE2b-256 6308f581fa778ab3c72eb05b3a758a9259e13aa66ca1537e841d52365eca448d

See more details on using hashes here.

Provenance

The following attestation bundles were made for anvil_serving-0.7.2-py3-none-any.whl:

Publisher: publish.yml on fakoli/anvil-serving

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page