Skip to main content

model-gear — run, assess, and switch the local vLLM model.

Project description

model-gear

model-gear is the tooling that runs, assesses, and switches the local, OpenAI-compatible vLLM model the Culture mesh consumes. The binary is modelmodel switch, model assess, model serve, and so on.

The served model is what the model-gear agent connects to over the acp vllm-local provider. The tool and the deployed agent share one identity: the same model-gear runs the engine and consumes it.

Sibling to culture (the agent mesh), daria (awareness), and steward (alignment).

Install

uv tool install model-gear

Usage

model init --apply          # scaffold a deployment dir (default $HOME/.model-gear)
model serve --apply         # start the vLLM server (alias: start)
model switch nvidia/Qwen3-32B-NVFP4 --apply   # switch the served model
model switch nvidia/Qwen3-32B-NVFP4 --purpose decode-heavy --machine spark --apply  # ...in a tuned gear
model status                # current model, container state, /health
model assess                # correctness probes (markdown for a per-model doc)
model benchmark             # decode throughput + prefill latency (shape follows --purpose)
model stop --apply          # stop the server

model overview              # tool snapshot + served model + candidate list
model whoami                # tool, machine, served model, container health
model explain switch        # markdown docs for a topic
model doctor                # diagnose docker / compose / .env / health

Every command supports --json. Write verbs (switch, serve, stop, init) are dry-run by default and require --apply to commit — agents call CLIs in loops, so safe-by-default is mandatory.

Running the model locally (vLLM)

model init scaffolds a deployment directory (default $HOME/.model-gear) from the packaged templates: a docker-compose.yml that stands up the vLLM model as an OpenAI-compatible server on :8000, plus a .env. Tuned for DGX Spark (GB10 Grace Blackwell, 128 GB unified memory) per build.nvidia.com/spark/vllm.

Prerequisites: the NVIDIA Container Toolkit, and docker login nvcr.io with an NGC API key to pull the nvcr.io/nvidia/vllm image.

model init --apply          # writes $HOME/.model-gear/{docker-compose.yml,.env}
# edit $HOME/.model-gear/.env to set HF_TOKEN if the model repo is gated
model serve --apply         # first run downloads ~28 GB of weights (the 27B primary)
model status                # waits/reports until /health is up

Verify it is up:

curl -fsS http://localhost:8000/health
curl -s http://localhost:8000/v1/models   # what's WARM now (the served model), not the catalog

Tunables live in the deployment .env (VLLM_MODEL, VLLM_GPU_MEM_UTIL, VLLM_MAX_MODEL_LEN, HF_CACHE, …). VLLM_SERVED_NAME must match the part after vllm-local/ in culture.yamlmodel doctor checks this. model switch rewrites these keys for you.

Tuning the gear (purpose + machine)

model switch resolves the serve config from three layers — the machine profile (--machine, default auto-detected: GPU-memory fraction, context, attention backend), the workload profile (--purpose, default balanced: the batching knobs and the shape model benchmark exercises), and the model's catalog entry (quantization, tool parser). Explicit --max-model-len / --gpu-mem-util flags override the machine defaults.

model switch sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP --purpose decode-heavy --machine spark --apply
model benchmark --purpose decode-heavy   # shape defaults to the configured VLLM_PURPOSE
model explain tuning                     # the full layering

Purposes: balanced (≈1K in/1K out), prompt-heavy (≈8K in/1K out), decode-heavy (≈1K in/8K out). Machines: spark (load-tested), thor, blackwell, generic (configured). The throughput flags and these shapes follow shahizat's cross-machine NVFP4 benchmark — see docs/tuning-profiles.md.

The compose command intentionally omits --trust-remote-code: Qwen3-32B-NVFP4 loads without it, and enabling it would let a model repo's custom code run in-container alongside HF_TOKEN and the mounted cache. Add it back only for a model whose repo ships custom modeling code. If vLLM rejects the nvidia/ ModelOpt checkpoint, set VLLM_MODEL to the vLLM-native RedHatAI/Qwen3-32B-NVFP4 and drop --quantization from the compose command.

Expose the API from anywhere (Cloudflare Tunnel)

model tunnel publishes the local OpenAI-compatible API at an owner-chosen hostname through a Cloudflare Tunnel, so Culture/AgentCulture agents can call it from anywhere as an ordinary provider (base_url + api_key) — no inbound ports, no static IP. The hostname and the run-token never live in committed config.

⚠️ Gate it first. A tunnel makes the model reachable from the public internet. Set CULTURE_VLLM_API_KEY in $HOME/.model-gear/.env before running model tunnel — vLLM then requires Authorization: Bearer $CULTURE_VLLM_API_KEY on every request. Empty leaves the API open; that is only safe for local dev. Generate or rotate the key with python3 scripts/gen-api-key.py (writes it to the gitignored deployment .env; --show to print, --force to rotate), then model serve --apply to enforce it. You can also set VLLM_SERVED_NAME to a generic alias (e.g. default) to keep the backend checkpoint name out of the public GET /v1/models.

Two steps — the Cloudflare side once, then the local side:

# 1) Cloudflare side, ONCE — provision the tunnel + ingress + DNS and seal the
#    run-token in shushu (tunnel-only mode; the backend authenticates itself):
cultureflare remote-login setup \
  --hostname your-host.example \
  --service http://127.0.0.1:8000 \
  --no-access --shushu --apply

# 2) Local side — copy the scaffolded example, fill in hostname + token source:
cp $HOME/.model-gear/cf-tunnel.env.example $HOME/.model-gear/.cf-tunnel.env
# edit $HOME/.model-gear/.cf-tunnel.env (it is gitignored — never commit it):
#   CULTURE_VLLM_PUBLIC_HOSTNAME=your-host.example
#   CULTURE_CF_TUNNEL_TOKEN_SHUSHU=<shushu-secret-name>

model serve --apply         # serve (with CULTURE_VLLM_API_KEY set in .env)
model tunnel                # DRY RUN: prints the cloudflared command + public URL
model tunnel --apply        # start the tunnel in the background
# ... later:
model tunnel --stop --apply # tear it down

The hostname resolves from --hostname$CULTURE_VLLM_PUBLIC_HOSTNAMECULTURE_VLLM_PUBLIC_HOSTNAME in .cf-tunnel.env; the run-token from CULTURE_CF_TUNNEL_TOKEN_SHUSHU (a shushu-sealed secret name, preferred) or CULTURE_CF_TUNNEL_TOKEN (plaintext fallback). --apply preflights that cloudflared (and shushu) is on PATH and that the local server answers /health first. cloudflared + shushu are runtime deps on the serving box.

Call it from anywhere — use your hostname and the alias you served (placeholders shown):

from openai import OpenAI

client = OpenAI(base_url="https://your-host.example/v1", api_key="$CULTURE_VLLM_API_KEY")
client.chat.completions.create(model="default", messages=[{"role": "user", "content": "hi"}])
curl -s https://your-host.example/v1/chat/completions \
  -H "Authorization: Bearer $CULTURE_VLLM_API_KEY" \
  -d '{"model":"default","messages":[{"role":"user","content":"hi"}]}'

Hardening (future). The bearer key is the minimum bar. For stronger exposure, layer Cloudflare Access (SSO/service tokens), a WAF rule or IP allowlist, and/or mTLS in front of the tunnel. Bearer auth currently gates the single-model deployment; the fleet gateway is not yet auth-aware (planned). See model explain tunnel for the full flow.

Running the model behind a gateway (fleet)

model init --fleet scaffolds a two-container deployment instead of one: the always-warm Qwen primary and a single stdlib gateway that fronts it on the host port the acp vllm-local provider already expects. The gateway routes each request by its model field and defaults an unknown/missing name to the primary — so existing single-model clients keep working unchanged, the same front fans /v1/audio/* out to the --audio overlay, and a warm fallback can be wired in later (the gateway adds a second backend, with failover, only when one is configured).

model init --fleet --apply        # $HOME/.model-gear/{docker-compose.yml,.env,Dockerfile.gateway}
docker login nvcr.io              # NGC API key for the vLLM image
model fleet up --apply            # builds the gateway image + starts the backend
model fleet status                # container states + gateway /health + /v1/models
curl -s http://localhost:8000/v1/models       # the WARM backend(s) (not the full catalog — see below)
# an unknown/missing model defaults to the primary; route explicitly by name:
curl -s http://localhost:8000/v1/chat/completions -d '{"model":"sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP","messages":[...]}'

The fleet is single-backend by default, so the primary runs at its load-tested solo headroom (PRIMARY_GPU_MEM_UTIL=0.6, full 256K context) on the 128 GB unified memory. To add a warm fallback, wire a vllm-fallback service + the FALLBACK_* env and drop both utils so they sum well under 1.0 (two ~30B NVFP4 models barely co-fit a GB10). model switch is single-model only — change the fleet primary by editing the fleet .env and re-running model fleet up --apply. See model explain fleet / model explain gateway for the routing semantics, and docs/gateway-fleet.md for the full topology.

Per-model notes

Each runtime model has a doc under docs/ recording how to run it, live test results, and caveats:

  • docs/qwen3.6-27b-text-nvfp4-mtp.md — the current runtime model and fleet default primary (sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP), the 27B re-exported with its MTP draft head restored so vLLM speculative decoding (Multi-Token Prediction) works. Text-only (ViT vision tower removed), NVFP4 (modelopt); served with --tokenizer=mmangkad/Qwen3.6-27B-NVFP4. Verified 2026-05-31: ~2.4x single-stream decode (8 → ~19 tok/s) at ~72-79% MTP draft acceptance, tool calling + reasoning confirmed.
  • docs/qwen3.6-27b-nvfp4.md — the archived former primary (mmangkad/Qwen3.6-27B-NVFP4), default primary 0.10.0–0.14.0; load-tested on DGX Spark (~8 tok/s decode, ~7 min warm-up). Kept as a candidate: it is the tokenizer source the MTP primary serves with and the only vision-capable 27B (the MTP primary is text-only).
  • docs/qwen3-32b-nvfp4.md — the dense candidate (nvidia/Qwen3-32B-NVFP4), faster on decode (~9.7 tok/s); swap in via PRIMARY_MODEL / model switch when throughput matters more than context/vision.
  • docs/mistral-small-3.2-24b-nvfp4.md — the dense fallback candidate (RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4); the default fleet's warm fallback in 0.11.0–0.19.x, since removed (the fleet is single-backend by default — two ~30B NVFP4 models don't co-fit a shared GB10). Kept selectable: load-tested 2026-05-30, loads reliably (~15 GiB, ~14.9 tok/s decode), text + tool calls (serve with the mistral tokenizer + images disabled). Wire it back via the opt-in FALLBACK_* fleet config.
  • docs/qwen3.6-35b-a3b-nvfp4.md — the former MoE fallback (mmangkad/Qwen3.6-35B-A3B-NVFP4), now a candidate. It does not load reliably on a GB10 shared with other services, and two ~30B models do not co-reside there — see docs/gateway-fleet.md.

The numbers in each doc come from model switch <model> --apply then model assess (correctness) and model benchmark (throughput). model overview --list lists the catalog (these models) and flags which one is currently served.

What's loaded vs. what's supported

Two questions that look alike but aren't:

  • What's supported (what can I warm up)? — the curated catalog of "gears" model-gear knows how to serve, each tagged load-tested (proven on this box) or configured (declared, not yet proven). It's static — defined in model_gear/catalog.py, shipped in the wheel, unchanged by what's running. Read it with model overview --list or the gateway's GET /v1/models/supported.
  • What's loaded right now? — the model(s) actually in GPU memory this instant (one in single-model mode, two in the fleet). The live source is GET /v1/models (OpenAI-standard); model fleet status queries it. model status / model whoami instead report the model the deployment is configured to serve (from .env) plus container health — normally the same model, but it's configuration (which can be stale), not a live query.
Question CLI HTTP
What can I run? (catalog) model overview --list GET /v1/models/supported
What's loaded right now? model fleet status GET /v1/models
What's the deployment set to serve? model status / model whoami

Mnemonic: the catalog is what's on the menu (and which dishes we've cooked); /v1/models is what's hot now. See docs/gateway-fleet.md.

model-gear is also the deployed agent

model-gear is one identity, not two: it is the repo/tool that serves the model and the local thinking agent deployed on it. The agent's runtime identity lives in AGENTS.md (the acp system prompt) and culture.yaml (suffix: model-gear, backend: acp, model: vllm-local/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP) — the same model-gear that runs the engine consumes it over the acp vllm-local provider.

Acknowledgements

The serve tuning (the flashinfer attention backend, chunked prefill, async scheduling, --max-num-seqs / --max-num-batched-tokens, and the MoE marlin + MTP speculative-decode flags) and the prompt-heavy / decode-heavy / balanced workload shapes follow shahizat's cross-machine NVFP4 benchmark of Qwen3.6-35B-A3B-NVFP4 on DGX Spark, Jetson Thor, and Blackwell 6000 Pro: Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark / Jetson Thor / Blackwell 6000 Pro (NVIDIA Developer Forums, 2026). See docs/tuning-profiles.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_gear-0.22.0.tar.gz (371.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

model_gear-0.22.0-py3-none-any.whl (134.1 kB view details)

Uploaded Python 3

File details

Details for the file model_gear-0.22.0.tar.gz.

File metadata

  • Download URL: model_gear-0.22.0.tar.gz
  • Upload date:
  • Size: 371.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for model_gear-0.22.0.tar.gz
Algorithm Hash digest
SHA256 71cb27bd2b4cda1eac6b6bd362cb302ee96f578524cca0e18e08abf55c77cb1a
MD5 bfc5caccc9dd9e96e545f0908c4c8eb1
BLAKE2b-256 9be44b1383f311b832071d92596b312d100fdb81d67085e232bc651d1f8f703e

See more details on using hashes here.

File details

Details for the file model_gear-0.22.0-py3-none-any.whl.

File metadata

  • Download URL: model_gear-0.22.0-py3-none-any.whl
  • Upload date:
  • Size: 134.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for model_gear-0.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cabdf14cd2e19f758bf48de1d34b59a6cdf30076e78ce9b5d9547a0492b85fcf
MD5 b99fc464b9c7d15eb6c923e61fbd8826
BLAKE2b-256 854027f256faf001638264591755fd19b86eb851fe5d2ff22da57a0b87928490

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page