Anthropic-compatible local LLM server for Claude Code, backed by mlx-lm / mlx-vlm on Apple Silicon.
Project description
tether
Local Anthropic-compatible LLM server for Claude Code,
backed by mlx-lm on Apple Silicon.
One command launches the proxy and opens Claude Code in the same
terminal — inspired by ollama launch claude.
Requirements
- macOS on Apple Silicon
- Python 3.12+
claudeonPATH
Install
From PyPI (the distribution is named tetherd — the name tether
was already taken — but it still installs the tether command):
pipx install tetherd # recommended — isolated venv, `tether` on PATH
# or
pip install tetherd
From GitHub (tracks main):
pipx install "git+https://github.com/ryank1m/tether.git"
For development (editable checkout):
git clone https://github.com/ryank1m/tether.git
cd tether
python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev]"
All forms expose the tether console script and pull in mlx-vlm
and litellm as dependencies.
Getting a model
tether never downloads weights itself. Place a model directory
anywhere under ~/.tether/models/ (any folder containing a
config.json is auto-discovered) and pass --model <name> or pick
it interactively:
# Option A: download from HuggingFace (needs network)
huggingface-cli download mlx-community/gemma-4-e2b-it-4bit \
--local-dir ~/.tether/models/gemma-4-e2b-it-4bit
# Option B: copy an already-downloaded model directory
# into ~/.tether/models/ by any means you like
# (scp, rsync, USB drive, sneakernet — tether only
# cares that config.json is present).
Usage
Launch Claude Code against a local MLX model — this is the default path, no extra terminals or env vars needed:
tether --model mlx-community/gemma-3-4b-it-4bit
Under the hood tether loads the model, brings up a LiteLLM
Anthropic-compatible proxy on 127.0.0.1:8080, waits for it to be
ready, and then spawns claude as a child process with
ANTHROPIC_BASE_URL, ANTHROPIC_DEFAULT_{OPUS,SONNET,HAIKU}_MODEL,
CLAUDE_CODE_SUBAGENT_MODEL, and friends wired up automatically.
Ctrl-C exits Claude Code and cleans up the proxy.
Drop models into ~/.tether/models/ and pick one interactively:
tether
# ↑/↓ or j/k · Enter to select · q/Esc to cancel
Local models (no HuggingFace involvement)
Passing an HF repo id (mlx-community/gemma-4-e2b-it-4bit) causes
mlx_vlm.load to revalidate the cached files against HF on every
start. To take HuggingFace out of the loop entirely, download once
and pass the directory directly:
huggingface-cli download mlx-community/gemma-4-e2b-it-4bit \
--local-dir ~/.tether/models/gemma-4-e2b-it-4bit
tether --model ~/.tether/models/gemma-4-e2b-it-4bit
The TUI picker auto-discovers any subdirectory of ~/.tether/models/
that contains a config.json, so a model you download there shows up
automatically when you run tether with no --model.
Forward arguments to claude with a trailing --:
tether --model mlx-community/gemma-3-4b-it-4bit -- --resume
List local models and exit:
tether --list
Config directory
tether reads every *.toml file in ~/.tether/config/ at
startup. On first run it creates
~/.tether/config/config.toml with sensible defaults (including
a 64k-token context cap) and a commented template for every field.
It will never overwrite that file again — edit it freely.
~/.tether/config/
├── config.toml # auto-created, always loaded first
├── 10-m4pro.toml # optional drop-in (per-machine tweaks)
└── 20-work.toml # optional drop-in (project-specific)
All files are merged on top of config.toml in alphabetical
order by filename. The NN-name.toml numeric-prefix convention
(same idea as systemd drop-ins) is the recommended way to control
ordering. Dotfiles, non-.toml files, and subdirectories are
ignored.
Precedence (highest → lowest):
- CLI flags (
--max-context 4096,--memory-cap 18GB, …) - Alphabetically-latest drop-in
- Earlier drop-ins
config.toml- Built-in defaults
Every option can also be set from the command line for one-off
overrides — see tether --help.
Example drop-in for a 24 GB M4 Pro running the 26B MoE variant:
# ~/.tether/config/10-m4pro.toml
[model]
default = "gemma-4-26b-a4b-it-4bit"
memory_cap = "18GiB"
max_context = 32000
Flags
| flag | default | purpose |
|---|---|---|
--model PATH_OR_REPO |
— | explicit model; skips the picker |
--host |
127.0.0.1 |
bind host for the proxy |
--port |
8080 |
bind port for the proxy |
--models-dir |
~/.tether/models |
override discovery dir |
--list |
— | list local models and exit |
--serve-only / --no-claude |
— | run only the proxy, do not launch claude |
--claude-path |
which claude |
explicit path to the claude binary |
--memory-cap |
from config | hard MLX memory ceiling, e.g. 18GB, 20GiB |
--max-context |
from config (64k) | truncate history so prompt stays under N tokens |
--log-level |
warning |
uvicorn log level |
-- … |
— | everything after -- is forwarded to claude |
Advanced: standalone proxy
If you want to point a non–Claude Code client at the proxy, or run it
under systemd / launchd, use --serve-only:
tether --serve-only --model mlx-community/gemma-3-4b-it-4bit
Then point any Anthropic-Messages-compatible client at
http://127.0.0.1:8080. Quick curl check:
curl -s -X POST http://127.0.0.1:8080/v1/messages \
-H 'Content-Type: application/json' \
-H 'anthropic-version: 2023-06-01' \
-d '{"model":"mlx-local","max_tokens":50,
"messages":[{"role":"user","content":"Say pineapple."}]}'
The proxy registers a wildcard route, so any model name in the request body routes to the loaded MLX model.
Known limitations (v0.1)
- Token usage counts are zero.
mlx_vlm.generateis not yet wired up to return prompt/completion token counts through to the response. - Single session. One shared prompt cache; two concurrent Claude Code sessions with different system prompts will thrash the cache. Fine for a single-user local server.
- No auth / TLS. Bind is
127.0.0.1only. - Apple Silicon + macOS only.
Architecture
See plan/plan-option-b.md (proxy + custom provider),
plan/plan-unified-launch.md (single-terminal launch), and
plan/plan-tool-use.md (Gemma 4 tool-call wiring). Request flow:
tether
├── uvicorn (daemon thread) ──► LiteLLM ──► MLXProvider ──► mlx_lm
└── subprocess: claude (foreground, owns TTY)
│
└── POST /v1/messages → proxy thread
Everything lives in one process tree. The model is loaded once at startup (eagerly, so load failures surface before Claude Code launches) and reused across every subsequent call.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tetherd-0.1.0.tar.gz.
File metadata
- Download URL: tetherd-0.1.0.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f56492d0f3003b0ba7e49390b58eba58db23a1f9b338cb95c51b08d36c9e942
|
|
| MD5 |
7af41af8cb45930ec25782e2addf1775
|
|
| BLAKE2b-256 |
a56f58edd661393fcb7b0ea0999a2657e51817b72ffa43e612a6ed68829da238
|
File details
Details for the file tetherd-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tetherd-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04aacb92be20f93131713c6e1f757133800f4927a6eeb3427739935b33fe3e62
|
|
| MD5 |
e392ebd08cf7fb86db81674322557b1d
|
|
| BLAKE2b-256 |
1574e7b8026750897b4282e6a4567623bb9ae3e364aade11140e912a489b6497
|