Skip to main content

Multi-tier local LLM stack: llama-swap + FastAPI auto-router + opencode wiring.

Project description

llmstack — multi-tier local LLM stack for Mac M4 Max / 64 GB

A Cursor-Auto / Claude-tier-style serving setup for local GGUF models, role-aware: coder models for agent work, chat models for planning, with an uncensored chat option for plans that need it.

Each tier can be served by either a local GGUF (default) or a hosted AWS Bedrock model — useful for the top-tier weights that don't fit on a laptop. Both backends share the same auto router, so opencode/curl/Cursor never need to know which one a tier resolves to.

Built on:

  • llama.cpp — inference engine (Metal backend)
  • llama-swap — multi-model process manager + OpenAI-compatible proxy
  • a tiny FastAPI router that adds an auto model with intent-based routing in front of llama-swap (and AWS Bedrock)
client (opencode / curl / Cursor / etc.)
        │
        ▼
  http://127.0.0.1:10101           <-- FastAPI router (llmstack.app)
        │   • model="auto" → classify → rewrite to one of 4 tiers
        │   • everything else → pass-through
        ▼
  http://127.0.0.1:10102           <-- llama-swap (binary, manages model lifecycle)
        │   • loads/unloads llama-server processes per model
        │   • matrix solver allows {code-fast + one heavy model} co-resident
        ▼
  llama-server <code-fast | code-smart | plan | plan-uncensored>
        │
        ▼
  GGUF in ~/.cache/huggingface/hub/...

The whole thing is a pure Python package distributed via standard Python tooling (pip install llmstack, or pip install -e . from this repo). Once installed you get a single llmstack console-script.

Why this design

A 64 GB unified memory M4 Max can comfortably hold one always-on tiny coder + one heavy model simultaneously. We split heavy models by role:

  • Agent work (multi-file edits, tool use, refactors) → coder models, which are trained on tool-call protocols and code edits.
  • Planning (design discussions, architecture, "what's the best approach") → chat-tuned models, which are better at high-level reasoning and don't try to start writing code in response to every message.
  • Uncensored planning is a separate plan-tier model, opted in either by request (agent.plan-nofilter in opencode) or by an inline [nofilter] trigger in the prompt.

Routing decisions cost ~zero — they're a few regex checks in the FastAPI router, not an LLM call.

Tier mapping

Alias Model Quant Weights Context Temp Role
code-fast Qwen2.5-Coder-3B-Instruct Q5_K_M ~2.5 GB 128k (YaRN ×4) 0.2 autocomplete, FIM, single-line edits, quick Q&A. Always loaded.
code-smart Qwen3-Coder-Next 80B-A3B (MoE) Q4_K_M (→ UD-Q4_K_XL) ~45 GB 64k 0.5 agent mode: multi-file edits, tool calls, refactors, debugging
plan Qwopus GLM 18B Merged Q4_K_M ~9 GB 64k (2× native) 0.7 plan mode: design, architecture, trade-off discussions
plan-uncensored Mistral-Small 3.2 24B Heretic (i1) i1-Q4_K_M (→ i1-Q6_K) ~13 GB 128k (native) 0.85 plan mode, no filter: when the topic requires it

Temperature ladder (low → high = "doing" → "thinking"): code-fast 0.2 (deterministic) · code-smart 0.5 (balanced agent) · plan 0.7 (creative ideation) · plan-uncensored 0.85 (max exploration). opencode agent.<name>.temperature is set to match — clients can still override per request.

How auto decides

The router runs a step-down fidelity ladder: start at the top tier for new / short conversations, drop down as the context grows. This inverts the classic "escalate when input gets big" pattern, and it matches how these models actually behave on this stack:

  • Top-tier hosted (Claude Opus/Sonnet on Bedrock) — fastest and most accurate on short prompts, but per-request latency and $cost scale with input tokens, and long-context behaviour degrades faster than headline benchmarks suggest.
  • code-smart (Qwen3-Coder 80B) — 64k window. Sweet spot is the middle of that range; saturates near the top.
  • code-fast (Qwen2.5-Coder 3B + YaRN ×4) — 128k window, always-resident, free. Smaller models lean on explicit context rather than priors, so they tend to improve relative to top-tier as the conversation grows.

First match wins:

# Condition → Model Reason
1 last user msg contains [nofilter], [uncensored], [heretic], or starts with uncensored: / nofilter: plan-uncensored explicit opt-in
2 [ultra] / [opus] / ultra: trigger AND code-ultra tier configured code-ultra explicit top-tier opt-in
3 plan verbs (design, architect, approach, trade-off, should we, explain why, …) AND no code blocks / agent verbs / tools plan pure design discussion (orthogonal track)
4 estimated input ≤ 8 000 tokens code-ultra (or code-smart if ultra unwired) top tier — context still being built, latency/$ are best here
5 estimated input ≤ 32 000 tokens code-smart mid-context, local heavy coder is at its sweet spot
6 otherwise (long context) AND (tools[] OR ≥ 6 turns) code-smart floor: 3B model tool-calls unreliably
7 otherwise (long context) code-fast 128k YaRN window + always-resident + free

Token estimates are chars / 4 over all message text + prompt. The code-ultra rungs (2 and 4) are gated on availability: when no [code-ultra] section is loaded from models.ini, both silently fall back to code-smart so vanilla installs don't 404.

opencode integration

llmstack install generates an opencode config at <work-dir>/.llmstack/opencode.json (derived from models.ini), where <work-dir> is whatever directory you ran llmstack from (or $LLMSTACK_WORK_DIR). You can cd into any project and run llmstack install to get a project-local config there. The script also copies AGENTS.md next to the generated JSON, so the .llmstack/ folder is a self-contained opencode bundle. Your global ~/.config/opencode/opencode.json is never modified by this stack.

opencode picks up our config because llmstack start (and llmstack shell) drop you into a subshell with these env vars exported:

Env var Value
OPENCODE_CONFIG <work-dir>/.llmstack/opencode.json (overrides global, sits below project configs)
LLMSTACK_CHANNEL current, next, or external (thin client of an llmstack router, see below)
LLMSTACK_ACTIVE 1 (used to refuse recursive entry)
LLMSTACK_ROOT absolute path to the installed llmstack package

The llama-swap and router daemons are singleton on ports 10101/10102. The channel is pinned at install time in .llmstack/default-channel and never auto-detected at runtime — one project on the host owns the daemons (installed local), and any other project on the same host that wants to consume them is installed --external (defaulting to http://127.0.0.1:10101). This avoids the footgun where a "shared" project's stop would tear down daemons it can't bring back up.

The shell's prompt is prefixed with [llmstack:<channel>] so you always know whether you're in the env or not. Bash and zsh source your normal rc first, then add the prefix; other shells just get the env vars.

Inside the subshell, run opencode and it will pick up the wiring below. Outside the subshell (any other terminal), opencode keeps using your global setup unchanged.

opencode agent Local model
default model llama.cpp/auto (router-routed)
small_model (titles, tasks, tab autocomplete) llama.cpp/code-fast
agent.build (default builder) llama.cpp/code-smart
agent.plan (read-only planner) llama.cpp/plan
agent.plan-nofilter (custom uncensored planner) llama.cpp/plan-uncensored

Inside opencode you can switch agents with /agent or by @plan-nofilter-mentioning a custom one. Slash-commands /review, /nofilter are also available.

Want a second terminal into the same stack? Install the activate hook once (eval "$(llmstack activate zsh)") and any new shell that cds into the project picks up OPENCODE_CONFIG automatically. Want to run opencode without the hook? OPENCODE_CONFIG=$PWD/.llmstack/opencode.json opencode from any directory you previously ran install in.

Layout

opencode/                       # repo root
├── pyproject.toml              # package metadata + `llmstack` console script
├── README.md                   # this file
├── UPGRADING.md                # how to swap any tier for a newer/better model
│                                  + how to upgrade the Python toolchain itself
├── models.ini                  # SINGLE SOURCE OF TRUTH for tiers + sampler
└── llmstack/                   # the python package (importable, installable)
    ├── __init__.py
    ├── __main__.py             # `python -m llmstack`
    ├── cli.py                  # arg dispatch (the `llmstack` console-script)
    ├── paths.py                # state / bin / work dir resolution + env overrides
    ├── shell_env.py            # spawn the env-prepared subshell + activate hooks
    ├── app.py                  # FastAPI auto-router (~280 lines)
    ├── tiers.py                # parse models.ini -> Tier dataclasses
    ├── check_models.py         # snapshot tool (HF metadata + drift check)
    ├── AGENTS.md               # opencode agent template (shipped as package data)
    ├── generators/
    │   ├── llama_swap.py       # render llama-swap.yaml from models.ini
    │   └── opencode.py         # render opencode.json from models.ini
    ├── download/
    │   ├── ggufs.py            # background GGUF downloader
    │   └── binary.py           # llama-swap release downloader
    └── commands/               # one module per CLI action
        ├── setup.py            # first-time walkthrough
        ├── install.py          # generate opencode.json (+ AGENTS.md copy)
        ├── install_llama_swap.py
        ├── download.py
        ├── start.py
        ├── shell.py
        ├── stop.py
        ├── restart.py
        ├── status.py
        ├── check.py
        └── activate.py

Per-project state (gitignored) is created lazily under <work-dir>/.llmstack/:

.llmstack/
├── opencode.json          consumed via OPENCODE_CONFIG (written by `install`)
├── AGENTS.md              copy of the package template (written by `install`)
├── llama-swap.yaml        generated runtime config (written by `start`)
├── default-channel        pinned by `llmstack install`
├── active-channel         written by `llmstack start`, removed by `stop`
├── llama-swap.pid         daemon pid files
├── router.pid
├── llmstack.bashrc        prompt-prefix rcfile (bash)
├── zdotdir/               prompt-prefix rcfile (zsh)
└── logs/
    ├── llama-swap.log
    ├── router.log
    └── dl-*.log

The llama-swap binary lives outside any project at $XDG_DATA_HOME/llmstack/bin/llama-swap (override with LLMSTACK_BIN_DIR). One download is reused across all projects.

Quick start

Everything runs through one entry point: llmstack <action>. Run llmstack help to see all actions and options.

# 0. Install the package (editable, from this repo).
python3 -m venv .venv
.venv/bin/pip install -e .

# 1. (Recommended) raise GPU-wired memory to fit code-fast + code-smart together.
sudo sysctl iogpu.wired_limit_mb=57344

# 2. Full setup: download GGUFs, wait, install the llama-swap binary, print
#    the activation hook, check opencode is on PATH. Stepwise & idempotent;
#    re-running it later is safe.
llmstack setup

# 3. Generate this project's .llmstack/opencode.json (+ AGENTS.md copy).
#    `install` does NOT touch llama-swap.yaml -- that's regenerated
#    fresh by `start` for the channel you're booting into.
llmstack install

# 4. Generate .llmstack/llama-swap.yaml for the chosen channel, bring up
#    llama-swap + router. With the activate hook installed (see below),
#    your prompt is already wired to .llmstack/opencode.json -- just run
#    `opencode`. Without the hook, `start` falls back to spawning a
#    subshell with OPENCODE_CONFIG set, prefixed with [llmstack:current].
#    Daemons keep running when you exit; stop them with `llmstack stop`.
llmstack start

# 4a. Daemons only (no fallback subshell, return immediately).
llmstack start --detach

# 4b. Want auto-activation in any new terminal you cd into? Install once:
eval "$(llmstack activate zsh)"
# add the same line to ~/.zshrc to make it stick.

# 5. Sanity check (works from any terminal)
llmstack status
curl -s http://127.0.0.1:10101/v1/models | jq '.data[].id'
curl -s http://127.0.0.1:10101/models.ini | head    # what thin clients see

To stop everything: llmstack stop.

Windows

The CLI runs the same way on Windows (PowerShell or cmd.exe); the only moving parts that differ are the binary asset and the activation hook.

# 0. Install the package (editable, from this repo).
py -3 -m venv .venv
.venv\Scripts\pip install -e .

# 1. Pull GGUFs + the windows_amd64 llama-swap binary (lives under
#    %LOCALAPPDATA%\llmstack\bin\llama-swap.exe).
.venv\Scripts\llmstack setup

# 2. Generate this project's .llmstack\opencode.json (+ AGENTS.md copy).
.venv\Scripts\llmstack install

# 3. Generate .llmstack\llama-swap.yaml for the chosen channel, bring up
#    the stack. If you've installed the activate hook (step 4) the
#    current shell is already wired to .llmstack\opencode.json; otherwise
#    `start` falls back to spawning a PowerShell subshell.
.venv\Scripts\llmstack start

# 4. Auto-activate per project from any new PowerShell window. The hook
#    file is a .ps1 (PowerShell won't dot-source it without that
#    extension) and dot-sourcing it requires script execution to be
#    allowed -- if you see "running scripts is disabled on this
#    system", run once:
#       Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
llmstack activate powershell | Out-String | Invoke-Expression
# or persist (writes ~/.powershell_llmstack_hook.ps1 + sources it on every shell):
"llmstack activate powershell | Out-String | Invoke-Expression" | Add-Content $PROFILE

Notes:

  • Only windows_amd64 llama-swap binaries are published upstream; arm64 Windows is not supported. GPU acceleration uses whatever backend llama-server was built with (CUDA / Vulkan / CPU) -- get llama-server.exe from the llama.cpp Windows releases or a package like winget install ggml.llama-cpp and put it on PATH (or set $env:LLAMA_SERVER_BIN). The Mac-only iogpu.wired_limit_mb step does not apply.
  • The [llmstack:<channel>] prompt prefix shows up in PowerShell too; cmd.exe gets a simpler [llmstack:<channel>] prompt via doskey.
  • Stopping daemons uses taskkill /T /F under the hood, so the llama-server children get cleaned up as well.

Thin-client mode (--external)

llmstack install --external [URL] wires this project as a thin client of an llmstack router — no llama-swap, no router, no GGUFs needed locally, and no local models.ini. The thin-client install:

  1. Fetches GET URL/models.ini live from the router (this also doubles as the health check — a 200 with valid INI proves the router is up).
  2. Renders opencode.json against the fetched content so tier names
    • descriptions agree with what the router actually serves.
  3. Pins .llmstack/default-channel = "external <url>" so subsequent commands know they're in client mode.

There is no client-side cache: every install re-fetches. To pick up a tier edit on the router, just re-run llmstack install here.

URL precedence at install time: --external <url> arg > $LLMSTACK_REMOTE_URL env var > the local router (http://127.0.0.1:10101). You normally don't set the env var yourself — the activate hook does it for you when you cd into an external-installed project (see below).

Two flavours of the same mode:

Same host, two projects. One project owns the daemons (local install), the others are thin clients of localhost. Zero config:

# project A — owns the daemons
cd ~/projA && llmstack install && llmstack start

# project B — consumes them
cd ~/projB && llmstack install --external
                              # baseURL = http://127.0.0.1:10101/v1
                              # default-channel = "external http://127.0.0.1:10101"
                              # (no local models.ini -- fetched from project A's router)
llmstack start                # verifies /models.ini, drops into the client subshell

Different host. Point at a beefy desktop's router from a laptop:

# laptop -> desktop running llmstack on 10.0.0.5
llmstack install --external http://10.0.0.5:10101
llmstack start               # verifies http://10.0.0.5:10101/models.ini
opencode                     # talks straight to the remote router

(LLMSTACK_REMOTE_URL=http://10.0.0.5:10101 llmstack install also works — the env var is honoured as an alternative way in.)

The URL is persisted into the channel marker, so any new terminal you open with the activate hook installed (eval "$(llmstack activate zsh)") will re-export LLMSTACK_REMOTE_URL automatically when you cd into the project. The prompt is medium-purple with the URL: [llmstack:<project> http://10.0.0.5:10101]. From inside that activated shell, llmstack install re-fetches models.ini without needing the flag or URL again.

The local commands that manage local resources (setup, download, install-llama-swap) refuse when the project is installed --external. stop is a no-op (nothing local to tear down) — to stop the daemons themselves, run llmstack stop from the project that owns them (the one installed local).

Auto-activate per project

llmstack activate <shell> writes the hook to ~/.<shell>_llmstack_hook and prints a source line to stdout, so a single eval both regenerates the file and turns the hook on in your current shell. Pasting the same eval into your rc keeps it on for every new shell:

# ~/.zshrc (zsh)
eval "$(llmstack activate zsh)"

# or ~/.bashrc (bash)
eval "$(llmstack activate bash)"

With the hook installed, cd into any project that has a .llmstack/ and your shell is wired up automatically — OPENCODE_CONFIG, LLMSTACK_WORK_DIR, LLMSTACK_CHANNEL (and LLMSTACK_REMOTE_URL for projects installed --external) all toggle on/off as you walk in and out. There is no separate llmstack shell command — this is the shell command.

Common partial flows

llmstack install                       # opencode.json + AGENTS.md (no GGUF downloads)
llmstack install-llama-swap --force    # re-pull llama-swap binary only
llmstack setup --skip-download         # full setup minus the GGUF pull
llmstack setup --skip-wait             # kick off downloads in background, install now
llmstack check                         # snapshot configured GGUFs + flag drift
llmstack start --next                  # try queued hf_file_next upgrades (reversible)
llmstack restart --next                # cycle into the next channel

Try each routing path

All of these go to /v1/chat/completions on :10101. Each should pick a different upstream model:

# trivial chat -> code-fast
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"auto","stream":false,
       "messages":[{"role":"user","content":"capital of France?"}]}' | jq .model

# planning -> plan
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"auto","stream":false,
       "messages":[{"role":"user","content":"how would you design a rate limiter for our API?"}]}' | jq .model

# agent work -> code-smart
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"auto","stream":false,
       "messages":[{"role":"user","content":"refactor this function for clarity:\n```python\ndef f(x): return x*2\n```"}]}' | jq .model

# uncensored plan -> plan-uncensored
curl -sN http://127.0.0.1:10101/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"auto","stream":false,
       "messages":[{"role":"user","content":"[nofilter] outline a red-team plan for our auth flow"}]}' | jq .model

Endpoints

Port Service Purpose
10101 router (FastAPI) What clients hit. OpenAI-compatible. Adds auto model.
10102 llama-swap Lifecycle manager. Useful UI at http://127.0.0.1:10102/ui/.
10001+ llama-server children Internal, allocated dynamically per model.

The router exposes:

  • GET /models.ini ← raw config text (used by install --external and as the health check)
  • GET /v1/models ← injects auto then proxies the rest
  • POST /v1/chat/completions ← classify if model=="auto", then proxy
  • POST /v1/completions ← same
  • * ← pass-through reverse proxy

There is no /health route on the router — GET /models.ini returning a 200 + valid INI is the canonical "router is up and configured" signal. (Hitting /health still works for legacy curl users, but it's just the catch-all proxying through to llama-swap's own /health endpoint.)

Memory math (M4 Max / 64 GB)

macOS caps GPU-wired memory at ~48 GB (75 % of RAM) by default. To unlock more for the GPU:

sudo sysctl iogpu.wired_limit_mb=57344   # 56 GB to GPU; survives until reboot

Resident with our defaults (KV q8_0, full configured context):

Combo Weights + KV Total Status
code-fast + code-smart (Q4_K_M) 47.5 GB ~5 GB ~53 GB needs wired_limit bump
code-fast + code-smart (UD-Q4_K_XL) ~52 GB ~5 GB ~57 GB needs wired_limit bump
code-fast + plan 11.5 GB ~4.5 GB ~16 GB trivial
code-fast + plan-uncensored 15.5 GB ~12.5 GB ~28 GB trivial
code-fast + plan + plan-uncensored ~25 GB ~14.5 GB ~40 GB both chats together
code-smart + plan-uncensored 58 GB matrix forbids

KV cache only fills up as context grows — these are worst-case numbers at the configured max context. Typical usage will be far less.

The matrix declares which combinations are valid. When you ask for a model that isn't currently loadable, the solver picks the cheapest set to swap into.

Upgrading quants after downloads finish

All three pre-queued upgrades are same-model, higher-quant — drop-in replacements with no behaviour change beyond quality.

Logs are named dl-<tier>-<label>.log where <label> is current (file in models.ini hf_file) or next (file in models.ini hf_file_next).

When this log shows EOF (download done) …edit llama-swap.yaml -hff line in this tier …to
logs/dl-code-smart-next.log code-smart Qwen3-Coder-Next-UD-Q4_K_XL.gguf
logs/dl-plan-next.log plan Qwopus-GLM-18B-Healed-Q6_K.gguf
logs/dl-plan-uncensored-next.log plan-uncensored Mistral-Small-3.2-24B-Instruct-2506-ultra-uncensored-heretic.i1-Q6_K.gguf

The -hf <repo> lines stay the same; only the -hff <filename> line changes. After editing, also flip hf_filehf_file_next in models.ini so llmstack check no longer reports DRIFT!.

Then llmstack restart.

For changing to a different model entirely (different family/provider) see UPGRADING.md.

Tuning the router

All knobs are env vars; defaults are picked up by llmstack start.

Env var Default Meaning
LLAMA_SWAP_URL http://127.0.0.1:10102 upstream llama-swap
ROUTER_FAST_MODEL code-fast long-context (>= mid ceiling) → here
ROUTER_AGENT_MODEL code-smart mid-context + tools/loop floor → here
ROUTER_ULTRA_MODEL code-ultra short-context top tier → here (gated on availability)
ROUTER_PLAN_MODEL plan design/discussion verbs → here
ROUTER_UNCENSORED_MODEL plan-uncensored [nofilter] triggers → here
ROUTER_HIGH_FIDELITY_CEILING 12000 tokens; at or below this, route to top tier (ultra → smart fallback). Paired with code-ultra.ctx_size = 24000 (2x).
ROUTER_MID_FIDELITY_CEILING 32000 tokens; at or below this, route to code-smart; beyond, step down to code-fast. Paired with code-smart.ctx_size = 64000 (2x).
ROUTER_MULTI_TURN 6 turn count that floors the long-context rung at code-smart
ROUTER_HOST / ROUTER_PORT 127.0.0.1 / 10101 listen address
LOG_LEVEL info router log level

To force a request to never auto-route, set model to a concrete alias (code-fast, code-smart, plan, plan-uncensored, or any of their listed aliases like agent, glm, nofilter, …).

Triggering uncensored mode

Two ways:

  1. Explicit agent in opencode: /agent plan-nofilter (or mention it).
  2. Inline trigger in any auto-routed message — anywhere in the most recent user turn:
    • [nofilter], [uncensored], [heretic]
    • or a line starting with uncensored: / nofilter: / no-filter:

Triggers are only checked on the latest user message and the system prompt, so an old [nofilter] further up the conversation won't pin the whole session.

Troubleshooting

llama-swap won't start → check .llmstack/logs/llama-swap.log. Most common causes: port 10102 already in use, or a typo in llama-swap.yaml.

First request hangs for ~60 s → that's the model loading from disk into Metal memory. sendLoadingState: true will surface "loading…" in the SSE stream. After it's loaded subsequent requests are instant.

OOM / unexplained slowdown → run top -o mem -stats pid,rsize,command to see what's resident. The matrix should prevent two heavy models loading together; if it somehow happens, llmstack restart.

Auto picks the wrong model → adjust the regex in llmstack/app.py (AGENT_SIGNALS / PLAN_SIGNALS / UNCENSORED_TRIGGERS) or move the ladder ceilings via ROUTER_HIGH_FIDELITY_CEILING / ROUTER_MID_FIDELITY_CEILING. To force a request to never auto-route, pass an explicit model (e.g. code-smart) instead of auto.

Want a pure pass-through (no auto routing) → change opencode's baseURL to http://127.0.0.1:10102/v1 (llama-swap directly) and only use concrete model names. (Note: this skips the bedrock dispatcher; only GGUF tiers will be reachable.)

Hosted tiers via AWS Bedrock

Any tier in models.ini that declares aws_model_id = ... is served from AWS Bedrock instead of llama-swap. The same tier names + auto-routing apply, so swapping code-smart from a local GGUF to Claude on Bedrock is a models.ini edit + llmstack install + llmstack restart away — clients don't change.

[code-smart]
role         = agent
aws_model_id = anthropic.claude-sonnet-4-5-20250929-v1:0
aws_region   = us-west-2
aws_profile  = bedrock-prod          ; named profile in ~/.aws/config
ctx_size     = 200000
sampler      = temp=0.5    ; Sonnet 4.5 accepts ONE of temp / top_p
description  = Claude Sonnet 4.5 on Bedrock - heavy coder for agent loops

Sampler is per-tier, declared in models.ini, applied per backend. opencode.json is intentionally sampler-free in both cases — clients just specify a model. How the sampler reaches the actual inference engine depends on the backend:

  • gguf tiers — the llama-swap generator bakes each tier's sampler = … keys into its llama-server startup command line as --temp / --top-p / --top-k / --min-p / --repeat-penalty flags. llama-server applies them as its defaults for every request. The router doesn't touch the body.
  • Bedrock tiers — Bedrock has no server-side defaults mechanism, so the router injects the sampler keys into each outbound request body (mapping temptemperature, top_ptopP; the other llama.cpp-extension keys top_k/min_p/rep_pen are silently dropped because Converse doesn't accept them). Caller-supplied values in the request body still win for per-call overrides.

Per-Bedrock-family rules (declare only what your Bedrock model accepts):

Bedrock model family What sampler may contain
Claude Opus 4.7+ (omit sampler = entirely — Opus 4.7 rejects all sampler params)
Claude Sonnet 4.5 / Haiku 4.5 temp or top_p, never both
Claude Opus 4.x (4.1, 4.5, 4.6) temp and/or top_p
Llama / Titan / Cohere / etc. temp and/or top_p (check the model card)

Local gguf tiers accept the full set (temp, top_p, top_k, min_p, rep_pen) — llama-server honours all of them as startup defaults.

models.ini is meant to be committable, so it only names a profile. Credentials, SSO, role chaining, MFA — everything boto3 normally handles — live in the standard AWS shared config:

aws configure --profile bedrock-prod        # static keys
aws configure sso --profile bedrock-prod    # SSO

# role chaining: edit ~/.aws/config, add a profile like
# [profile bedrock-planning]
# role_arn       = arn:aws:iam::123456789012:role/llmstack-bedrock
# source_profile = bedrock-prod
# region         = us-east-1

Then reference the profile by name from each tier. Different tiers can point at different profiles, so two tiers can live in different accounts/regions cleanly:

Key (in models.ini) Meaning
aws_model_id Bedrock model ID (anthropic.claude-..., meta.llama3-1-..., etc.). Required.
aws_region Region the tier lives in. Falls back to the profile's region / AWS_REGION / default chain.
aws_profile Named profile in ~/.aws/config / ~/.aws/credentials. Omit for boto3's default chain (env vars, default profile, instance role).
aws_endpoint_url Custom Bedrock endpoint (VPC endpoint, FedRAMP, etc.).
aws_model_id_next (+ optional aws_region_next) Queued upgrade target. Mirrors gguf hf_file_next: llmstack start --next swaps the tier to this model id (and region, if set) until you switch back; permanent promotion is aws_model_id edit + llmstack install.
backend = bedrock Optional explicit override; auto-detected from aws_model_id.

Banned in models.ini (parse-time error): aws_access_key_id, aws_secret_access_key, aws_session_token, aws_role_arn, aws_role_session_name. Put them in ~/.aws/credentials or ~/.aws/config under a named profile and reference the profile.

Internally the router builds one bedrock-runtime client per distinct (profile, region, endpoint) tuple, cached for the life of the process. Credential refresh (SSO token rotation, role re-assumption, IMDS) is handled by boto3 transparently.

Install the AWS SDK (it's an opt-in extra so the local-only path stays small):

pip install -e '.[bedrock]'

The router translates OpenAI chat/completions to Bedrock Converse (text + tool calls; streaming and non-streaming both supported) and streams the response back as standard OpenAI SSE. Multimodal inputs are text-only for now.

Hosted tiers are skipped by llmstack download (nothing to fetch) and by the llama-swap.yaml matrix (nothing to load). They show up in llmstack check with the model id + region (and a next row when aws_model_id_next is set) instead of HF metadata, and in /v1/models alongside the local GGUF tiers — including a channel: current|next metadata field so clients can tell which model id they're actually talking to.

llmstack start --next flips both backends in lock-step: gguf tiers swap to hf_file_next and bedrock tiers swap to aws_model_id_next (the router subprocess is launched with LLMSTACK_USE_NEXT=1). Either backend having a queued upgrade is enough to satisfy --next.

logs/dl-*.log is multi-GB and growing → you're hitting llama.cpp issue #14802 where modern llama-cli is chat-only and ignores -no-cnv, looping > prompts forever (~1.5 MB/s). Fix: llmstack download already prefers llama-completion over llama-cli when both are present (brew install llama.cpp ships both as of 2025). If you only have legacy llama-cli, either upgrade llama.cpp or kill the runaways with pkill -9 -f llama-cli.

Replacing a model with a newer/better one

See UPGRADING.md — covers why models must be GGUF, where to find candidates, how to evaluate "better" per tier, the safe upgrade workflow, and a worked example. Run llmstack check for a snapshot of what's currently configured along with HF URLs to compare against.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencode_llmstack-0.8.0.tar.gz (123.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencode_llmstack-0.8.0-py3-none-any.whl (114.4 kB view details)

Uploaded Python 3

File details

Details for the file opencode_llmstack-0.8.0.tar.gz.

File metadata

  • Download URL: opencode_llmstack-0.8.0.tar.gz
  • Upload date:
  • Size: 123.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for opencode_llmstack-0.8.0.tar.gz
Algorithm Hash digest
SHA256 5e76d9c7193613c5925a6e5ba33e579e8a35a9a5a83b25ba22e755a7cedf200c
MD5 861547989a372638ecbd36932efaeb4d
BLAKE2b-256 f663d3fb7e7c2bc15a54a13d4261bbcfd626516c1976d8d6bdf441b8a88766ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencode_llmstack-0.8.0.tar.gz:

Publisher: release.yml on rohitgarg19/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencode_llmstack-0.8.0-py3-none-any.whl.

File metadata

File hashes

Hashes for opencode_llmstack-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a63c0f569abe1dfc08ba930a3bcdcc2545b3d23d04d574f1bd3d0ea06d30c4d8
MD5 2854abf3045cfc5b79f836750ebff1c2
BLAKE2b-256 7cb923424fc13e57d6762654b18ea970504c980eb4a43de3566519569eb28788

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencode_llmstack-0.8.0-py3-none-any.whl:

Publisher: release.yml on rohitgarg19/llmstack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page