Skip to main content

Anthropic-compatible local LLM server for Claude Code, backed by mlx-lm / mlx-vlm on Apple Silicon.

Project description

tether

Local Anthropic-compatible LLM server for Claude Code, backed by mlx-lm on Apple Silicon.

One command launches the proxy and opens Claude Code in the same terminal — inspired by ollama launch claude.

Requirements

  • macOS on Apple Silicon
  • Python 3.12+
  • claude on PATH

Install

From PyPI (the distribution is named tetherd — the name tether was already taken — but it still installs the tether command):

pipx install tetherd         # recommended — isolated venv, `tether` on PATH
# or
pip install tetherd

From GitHub (tracks main):

pipx install "git+https://github.com/ryank1m/tether.git"

For development (editable checkout):

git clone https://github.com/ryank1m/tether.git
cd tether
python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev]"

All forms expose the tether console script and pull in mlx-vlm and litellm as dependencies.

Getting a model

tether never downloads weights itself. Place a model directory anywhere under ~/.tether/models/ (any folder containing a config.json is auto-discovered) and pass --model <name> or pick it interactively:

# Option A: download from HuggingFace (needs network)
huggingface-cli download mlx-community/gemma-4-e2b-it-4bit \
  --local-dir ~/.tether/models/gemma-4-e2b-it-4bit

# Option B: copy an already-downloaded model directory
#          into ~/.tether/models/ by any means you like
#          (scp, rsync, USB drive, sneakernet — tether only
#           cares that config.json is present).

Usage

Launch Claude Code against a local MLX model — this is the default path, no extra terminals or env vars needed:

tether --model mlx-community/gemma-3-4b-it-4bit

Under the hood tether loads the model, brings up a LiteLLM Anthropic-compatible proxy on 127.0.0.1:8080, waits for it to be ready, and then spawns claude as a child process with ANTHROPIC_BASE_URL, ANTHROPIC_DEFAULT_{OPUS,SONNET,HAIKU}_MODEL, CLAUDE_CODE_SUBAGENT_MODEL, and friends wired up automatically. Ctrl-C exits Claude Code and cleans up the proxy.

Drop models into ~/.tether/models/ and pick one interactively:

tether
# ↑/↓ or j/k · Enter to select · q/Esc to cancel

Local models (no HuggingFace involvement)

Passing an HF repo id (mlx-community/gemma-4-e2b-it-4bit) causes mlx_vlm.load to revalidate the cached files against HF on every start. To take HuggingFace out of the loop entirely, download once and pass the directory directly:

huggingface-cli download mlx-community/gemma-4-e2b-it-4bit \
  --local-dir ~/.tether/models/gemma-4-e2b-it-4bit
tether --model ~/.tether/models/gemma-4-e2b-it-4bit

The TUI picker auto-discovers any subdirectory of ~/.tether/models/ that contains a config.json, so a model you download there shows up automatically when you run tether with no --model.

Forward arguments to claude with a trailing --:

tether --model mlx-community/gemma-3-4b-it-4bit -- --resume

List local models and exit:

tether --list

Config directory

tether reads every *.toml file in ~/.tether/config/ at startup. On first run it creates ~/.tether/config/config.toml with sensible defaults (including a 64k-token context cap) and a commented template for every field. It will never overwrite that file again — edit it freely.

~/.tether/config/
├── config.toml         # auto-created, always loaded first
├── 10-m4pro.toml       # optional drop-in (per-machine tweaks)
└── 20-work.toml        # optional drop-in (project-specific)

All files are merged on top of config.toml in alphabetical order by filename. The NN-name.toml numeric-prefix convention (same idea as systemd drop-ins) is the recommended way to control ordering. Dotfiles, non-.toml files, and subdirectories are ignored.

Precedence (highest → lowest):

  1. CLI flags (--max-context 4096, --memory-cap 18GB, …)
  2. Alphabetically-latest drop-in
  3. Earlier drop-ins
  4. config.toml
  5. Built-in defaults

Every option can also be set from the command line for one-off overrides — see tether --help.

Example drop-in for a 24 GB M4 Pro running the 26B MoE variant:

# ~/.tether/config/10-m4pro.toml
[model]
default = "gemma-4-26b-a4b-it-4bit"
memory_cap = "18GiB"
max_context = 32000

Flags

flag default purpose
--model PATH_OR_REPO explicit model; skips the picker
--host 127.0.0.1 bind host for the proxy
--port 8080 bind port for the proxy
--models-dir ~/.tether/models override discovery dir
--list list local models and exit
--serve-only / --no-claude run only the proxy, do not launch claude
--claude-path which claude explicit path to the claude binary
--memory-cap from config hard MLX memory ceiling, e.g. 18GB, 20GiB
--max-context from config (64k) truncate history so prompt stays under N tokens
--log-level warning uvicorn log level
-- … everything after -- is forwarded to claude

Advanced: standalone proxy

If you want to point a non–Claude Code client at the proxy, or run it under systemd / launchd, use --serve-only:

tether --serve-only --model mlx-community/gemma-3-4b-it-4bit

Then point any Anthropic-Messages-compatible client at http://127.0.0.1:8080. Quick curl check:

curl -s -X POST http://127.0.0.1:8080/v1/messages \
  -H 'Content-Type: application/json' \
  -H 'anthropic-version: 2023-06-01' \
  -d '{"model":"mlx-local","max_tokens":50,
       "messages":[{"role":"user","content":"Say pineapple."}]}'

The proxy registers a wildcard route, so any model name in the request body routes to the loaded MLX model.

Known limitations (v0.1)

  • Token usage counts are zero. mlx_vlm.generate is not yet wired up to return prompt/completion token counts through to the response.
  • Single session. One shared prompt cache; two concurrent Claude Code sessions with different system prompts will thrash the cache. Fine for a single-user local server.
  • No auth / TLS. Bind is 127.0.0.1 only.
  • Apple Silicon + macOS only.

Architecture

See plan/plan-option-b.md (proxy + custom provider), plan/plan-unified-launch.md (single-terminal launch), and plan/plan-tool-use.md (Gemma 4 tool-call wiring). Request flow:

tether
   ├── uvicorn (daemon thread)  ──► LiteLLM ──► MLXProvider ──► mlx_lm
   └── subprocess: claude (foreground, owns TTY)
                      │
                      └── POST /v1/messages → proxy thread

Everything lives in one process tree. The model is loaded once at startup (eagerly, so load failures surface before Claude Code launches) and reused across every subsequent call.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tetherd-0.1.0.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tetherd-0.1.0-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file tetherd-0.1.0.tar.gz.

File metadata

  • Download URL: tetherd-0.1.0.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tetherd-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f56492d0f3003b0ba7e49390b58eba58db23a1f9b338cb95c51b08d36c9e942
MD5 7af41af8cb45930ec25782e2addf1775
BLAKE2b-256 a56f58edd661393fcb7b0ea0999a2657e51817b72ffa43e612a6ed68829da238

See more details on using hashes here.

File details

Details for the file tetherd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tetherd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tetherd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 04aacb92be20f93131713c6e1f757133800f4927a6eeb3427739935b33fe3e62
MD5 e392ebd08cf7fb86db81674322557b1d
BLAKE2b-256 1574e7b8026750897b4282e6a4567623bb9ae3e364aade11140e912a489b6497

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page