Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints

Project description

Kamiwaza-MLX 📦

A simple openai (chat.completions) compatible mlx server that:

Supports both vision models (via flag or config/model-name detection) and text-only models
Supports streaming boolean flag
Has a --strip-thinking which will remove tag (in both streaming and not) - good for backwards compat
Supports usage to the client in openai style
Prints usage on the server side output
Appears to deliver reasonably good performance across all paths (streaming/not, vision/not)
Has a terminal client that works with the server, which also support syntax like image:/path/to/image.png Describe this image in detail
Experimental direct multi-node bootstrap via mlx.distributed when PAIRED_HOST is provided; the current Kamiwaza TP=2 launcher/proxy flow is documented in MLX_TWO_NODE_README.md

Tested largely with Qwen2.5-VL and Qwen3 models.

Current MLX Stack

The current dependency floor tracks the latest MLX packages validated for the two-node work:

mlx>=0.31.2,<0.32.0
mlx-lm>=0.31.3,<0.32.0
mlx-metal>=0.31.2,<0.32.0
mlx-vlm>=0.4.4,<0.5.0
mlx-audio>=0.4.3,<0.5.0

Nemotron 3 Nano Omni currently needs mlx-vlm 0.4.5 code from upstream GitHub for the MLX-community conversions. PyPI only has mlx-vlm 0.4.4 at the time of this validation, so requirements.txt and the redeploy helpers pin:

git+https://github.com/Blaizzy/mlx-vlm.git@15dbf7265c02e4bb5deee580030e13ee8659b643

mlx-community/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-4bit works locally via mlx_vlm generate for combined text, image, and audio inputs. Native video is not yet functional in this MLX path; the current upstream Nemotron MLX model code raises NotImplementedError: Efficient video sampling is not implemented for Nemotron Omni yet.

Experimental DFlash speculative decoding

DFlash is available as an opt-in text-only generation path. Configure a draft model and the server will use a DFlash speculative loop instead of the normal mlx_lm.generate.stream_generate loop. The default backend is auto: it uses the standalone dflash-mlx implementation when installed, otherwise it falls back to ZLab's dflash.model_mlx.

The DFlash backends install from git (PyPI does not allow direct-URL dependencies in package metadata). These are the tested pins:

# ZLab backend
pip install "dflash[mlx] @ git+https://github.com/z-lab/dflash.git@6d6229eaddce58fcd7e4cc91945da632f5544c86"

# Standalone dflash-mlx backend
pip install "dflash-mlx @ git+https://github.com/bstnxbt/dflash-mlx.git@fada1eb2b75cd1c875ca6547b6518783fd3d2956"

python -m kamiwaza_mlx.server \
  -m mlx-community/Qwen3.5-27B-5bit \
  --dflash-draft-model z-lab/Qwen3.5-27B-DFlash \
  --dflash-backend auto \
  --dflash-block-size 16 \
  --dflash-default-max-tokens 2048 \
  --dflash-max-context -1

Equivalent environment variables:

export KAMIWAZA_MLX_DFLASH_DRAFT_MODEL=z-lab/Qwen3.5-27B-DFlash
export KAMIWAZA_MLX_DFLASH_BACKEND=auto
export KAMIWAZA_MLX_DFLASH_BLOCK_SIZE=16
export KAMIWAZA_MLX_DFLASH_DEFAULT_MAX_TOKENS=2048
export KAMIWAZA_MLX_DFLASH_MAX_CONTEXT=-1
export KAMIWAZA_MLX_DFLASH_VERIFY_MODE=auto
export KAMIWAZA_MLX_DFLASH_VERIFY_QMM=true
export KAMIWAZA_MLX_DFLASH_STREAM_SYNC_EVERY=0

Notes:

DFlash is only enabled when --dflash-draft-model is non-empty, and that forces a text-only mlx-lm target load even for Qwen3.5 conversions that retain a vision_config.
The Qwen3.5-27B drafter is trained for target Qwen/Qwen3.5-27B; MLX quantized conversions such as mlx-community/Qwen3.5-27B-5bit and mlx-community/Qwen3.5-27B-6bit are structurally compatible targets to try, but acceptance and quality need measurement per quant.
The z-lab/Qwen3.5-27B-DFlash draft config uses five full_attention layers. The dflash-mlx runtime has draft sink/window machinery for other draft shapes, but those knobs do not change this Qwen3.5 speculative model's draft attention.
Server KV and prefix caches are disabled while DFlash is enabled because DFlash allocates its own per-request target/draft prompt caches and mutates target model hooks during generation.
--dflash-backend auto picks dflash-mlx when the standalone dflash_mlx package is installed (see the tested git pin above); otherwise it falls back to the ZLab backend. The inline dflash-mlx adapter runs without that package's own prefix snapshot cache, so it is meant for local kernel/runtime experiments, not as a replacement for that package's full OpenAI-compatible server. It also installs a Qwen target-op compatibility patch that skips the full-vocabulary lm-head projection for large prompt-prefill chunks whose logits are discarded by the DFlash loop, and disables unused prefix-snapshot work because the inline adapter does not provide a dflash-mlx prefix cache.
For eligible 4-bit or 8-bit MLX target linears, the dflash-mlx adapter also installs that package's verifier-linear wrappers. Use --dflash-verify-mode off to disable the swap, or --no-dflash-verify-qmm to keep the wrappers while disabling the custom qmm kernels.
--dflash-max-context is a degradation guard: prompts at or above that token count bypass DFlash and use the normal MLX generation path. Omit it, or set it to -1, to use the conservative auto cap of 4000 prompt tokens. Set it to 0 only when you intentionally want to route every prompt through DFlash. Because DFlash mode disables the server-level shared KV/prefix caches, bypassed requests do not use shared prefix-cache optimizations, but they still allocate the normal bounded per-request generation cache.
DFlash generation is serialized with an internal lock. That protects the target model hidden-state hooks and Qwen3.5 gated-delta rollback patch.
Extra mx.synchronize() calls in the DFlash streaming loop are disabled by default. Set --dflash-stream-sync-every N only when diagnosing stream timing or device scheduling.

Note: Not specific to Kamiwaza (that is, you can use on any Mac, Kamiwaza not required)

pip install kamiwaza-mlx

# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000

# chat from another terminal (note: specify --host to match server port)
python -m kamiwaza_mlx.infer --host localhost:18000 -p "Say hello"

The remainder of this README documents the original features in more detail.

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

A FastAPI micro-server (server.py) that speaks the OpenAI /v1/chat/completions dialect, plus a tiny CLI client (infer.py) for quick experiments. Ideal for poking at huge models like Dracarys-72B on an M4-Max/Studio, hacking on prompts, or piping the output straight into other tools that already understand the OpenAI schema.

✨ Highlight reel

Feature	Details
🔌 OpenAI compatible	Same request / response JSON (streaming too) – just change the base-URL.
📦 Zero-config	Point at a local folder or HuggingFace repo (`-m /path/to/model`).
🖼️ Vision-ready	Accepts `{"type":"image_url", …}` parts & base64 URLs – works with Qwen-VL & friends.
🎥 Video-aware	Auto-extracts N key-frames with ffmpeg and feeds them as images.
🧮 Usage metrics	Prompt / completion tokens + tokens-per-second in every response.
⚙️ CLI playground	`infer.py` gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag…

🚀 Running the server

# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit

# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345

Default host/port: 0.0.0.0:18000

Most useful flags:

Flag	Default	What it does
`-m / --model`	`mlx-community/Qwen2-VL-2B-Instruct-4bit`	Path or HF repo.
`--host`	`0.0.0.0`	Network interface to bind to.
`--port`	`18000`	TCP port to listen on.
`-V / --vision`	off	Force vision pipeline; otherwise auto-detect from model name/config.
`--strip-thinking`	off	Removes `<think>…</think>` blocks from model output.
`--enable-prefix-caching`	`True`	Enable automatic prompt caching for text-only models. If enabled, the server attempts to load a cache from a model-specific file in `--prompt-cache-dir`. If not found, it creates one from the first processed prompt and saves it.
`--prompt-cache-dir`	`./.cache/mlx_prompt_caches/`	Directory to store/load automatic prompt cache files. Cache filenames are derived from the model name.

KV cache flags (all KV-related CLI knobs)

System-prefix cache (system-only)

Flag	Default	What it does
`--enable-prefix-caching`	`True`	Enable system-prefix caching for text-only models.
`--prompt-cache-dir`	`./.cache/mlx_prompt_caches/`	Directory for system-prefix cache files (`.safetensors`, `.len`, `.hash`).
`--system-cache-max-tokens`	`2048`	Max tokens to cache from the system prompt (`0` = unlimited). If the system prompt exceeds this cap, system caching is skipped.
`--prefix-cache-headroom`	`64`	Extra tokens reserved beyond the system prompt length when sizing the system cache.

Conversation cache (simple global prefix-match)

Simple mode uses a single global prefix-match KV cache. If the incoming prompt shares the previous prompt as a prefix, we skip prefill for the cached portion; otherwise the cache is reset. Conversation IDs are used for logging/metadata only (they do not create separate caches).

Flag	Default	What it does
`--disable-kv-cache`	off	Disable all KV caching (prefix + conversation).
`--kv-cache-max-tokens`	`0`	Per-cache upper bound (`0` = min(model context, 128k)).
`--kv-cache-keep`	`4`	Tokens to keep when trimming rotating caches.
`--kv-cache-idle-release-seconds`	`0`	Release KV caches after idle time (0 disables).
`--kv-cache-hard-reserve`	`True`	Fully reserve KV memory up to sizing target.
`--kv-cache-warmup`	`False`	Run a warm-up pass at startup to materialize KV shapes.
`--kv-cache-warmup-tokens`	`0`	Warmup tokens (`0` = use resolved target size).
`--retain-mx-cache`	`False`	Keep MX allocator memory (disables `mx.clear_cache`).

Experimental multi-node via `mlx.distributed`

This section documents the direct server.py distributed bootstrap path.

It is distinct from the current Kamiwaza-managed two-node serving flow, which uses kamiwaza-mlx-distributed plus upstream mlx_lm.server and is documented in MLX_TWO_NODE_README.md.

The server can bootstrap a two-node mesh using mlx.distributed. Set a rendezvous host via PAIRED_HOST (optionally in a .env file) and launch each node with matching ranks/world-size. The helper will automatically read .env files passed via --distributed-env-file or located beside the server script.

# shared settings (either export or place in .env)
PAIRED_HOST=10.0.0.2
PAIRED_PORT=17863
WORLD_SIZE=2

# leader node (rank 0 hosts FastAPI)
RANK=0 python -m kamiwaza_mlx.server --distributed-env-file .env -m ./model

# worker node (rank 1 participates in mlx.distributed but does not bind HTTP)
RANK=1 python -m kamiwaza_mlx.server --distributed-env-file .env -m ./model

Useful knobs:

--distributed – force-enable/disable distributed mode (auto when PAIRED_HOST or WORLD_SIZE>1).
--distributed-rank / --distributed-world-size – override RANK/WORLD_SIZE env vars.
--distributed-host / --distributed-port – override PAIRED_HOST / PAIRED_PORT.
--distributed-server-rank – choose which rank should host the HTTP server (defaults to 0).

In this direct path, non-leader ranks simply keep the MLX runtime alive for collective ops once the model weights are synchronized. Treat it as a bootstrap/experimentation path, not as the primary Kamiwaza-managed TP=2 serving mode.

💬 Talking to it with the CLI

python kamiwaza_mlx/infer.py --host localhost:18000 --max_new_tokens 2048

Interactive keys

Ctrl-N: reset conversation
Ctrl-C: quit

🌐 HTTP API

GET /v1/models

Returns a list with the currently loaded model:

{
  "object": "list",
  "data": [
    {
      "id": "Dracarys2-72B-Instruct-4bit",
      "object": "model",
      "created": 1727389042,
      "owned_by": "kamiwaza"
    }
  ]
}

The created field is set when the server starts and mirrors the OpenAI API's timestamp.

POST /v1/chat/completions

{
  "model": "Dracarys2-72B-Instruct-4bit",
  "messages": [
    { "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image." },
        { "type": "image_url",
          "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ],
  "max_tokens": 512,
  "stream": false
}

Response (truncated):

{
  "id": "chatcmpl-d4c5…",
  "object": "chat.completion",
  "created": 1715242800,
  "model": "Dracarys2-72B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "The image shows…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 143,
    "completion_tokens": 87,
    "total_tokens": 230,
    "tokens_per_second": 32.1
  }
}

Add "stream": true and you'll get Server-Sent Events chunks followed by data: [DONE].

System Prefix Caching (Text-Only Models):

Purpose: Dramatically speed up repeated queries that share the same system context (e.g., large document in role: system). The server caches only the system message(s), not the whole prompt, so subsequent turns process only new user tokens.
Flags:
- --enable-prefix-caching (default True)
- --prompt-cache-dir (default ./.cache/mlx_prompt_caches/)
- --system-cache-max-tokens (default 2048, 0 disables the cap)
- --prefix-cache-headroom (default 64)
How it works (high‑level):
1. On first request with a system message, the server builds a KV cache for just the system portion and saves three files under --prompt-cache-dir:
  - <model>.safetensors (KV), <model>.safetensors.len (token count), <model>.safetensors.hash (SHA256 over token IDs)
2. On subsequent requests with the same system text (hash matches), the server deep‑copies the cached KV and processes only new user/assistant tokens.
3. If the system message changes, the old cache is discarded and replaced automatically.
4. If the system prompt exceeds --system-cache-max-tokens, system caching is skipped and the full prompt is prefetched normally.
Example: A 10,000‑token system document is processed once; later questions only process the user tokens.
Notes: text‑only models; fully transparent to clients (no special fields needed).

Conversation KV Caching (Long chats, fast follow‑ups):

Rationale: Reuse KV across turns so only the tail of the prompt is prefetched.
Behavior:
- Provide conversation or conversation_id (or X-Conversation-Id) if you want IDs reflected in logs/metadata. If omitted, the server uses default.
- The server returns headers for every request (JSON & SSE):
  - X-Conv-Id (resolved ID), X-Conv-KV (fresh|hit|reset|none|busy|disabled), X-Conv-Cached-Tokens, X-Conv-Processing-Tokens.
- Non‑stream JSON also includes usage.input_tokens_details.cached_tokens and metadata.conversation_id.
- The global cache is capped at min(model context, 128k) unless overridden by --kv-cache-max-tokens.
- Concurrency: only one request at a time uses KV caches. When the cache is in use, other concurrent requests run with caching disabled (X-Conv-KV: busy).
Breaking change: the legacy /v1/conv_kv/* endpoints were removed in this branch.

🛠️ Internals (two-sentence tour)

server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back. For text-only models, if enabled via server flags, it automatically manages a system message cache to speed up processing when multiple queries reference the same system context.
infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.

That's it – drop it in front of any MLX model and start chatting!

Project details

Release history Release notifications | RSS feed

This version

0.2.7

Jul 6, 2026

0.2.6

May 8, 2026

0.2.5

Feb 6, 2026

0.2.4

Jan 29, 2026

0.2.3

Jan 2, 2026

0.2.2

Dec 29, 2025

0.2.1

Dec 27, 2025

0.2.1b20251229053655 pre-release

Dec 29, 2025

0.2.1b20251226203154 pre-release

Dec 27, 2025

0.1.9

Nov 13, 2025

0.1.8

Nov 3, 2025

0.1.8b20251031234626 pre-release

Nov 3, 2025

0.1.7

Oct 9, 2025

0.1.6

Sep 21, 2025

0.1.6b20251008223113 pre-release

Oct 9, 2025

0.1.5

Jun 9, 2025

0.1.4

Jun 9, 2025

0.1.3

May 16, 2025

0.1.2

May 16, 2025

0.1.1

May 16, 2025

0.1.0

May 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kamiwaza_mlx-0.2.7.tar.gz (86.8 kB view details)

Uploaded Jul 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kamiwaza_mlx-0.2.7-py3-none-any.whl (79.2 kB view details)

Uploaded Jul 6, 2026 Python 3

File details

Details for the file kamiwaza_mlx-0.2.7.tar.gz.

File metadata

Download URL: kamiwaza_mlx-0.2.7.tar.gz
Upload date: Jul 6, 2026
Size: 86.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for kamiwaza_mlx-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`5ad67ae56a279513a9b8984a349dd2d8f902a79b75bae64c5fe41fa21fe26290`
MD5	`4aea57bb485c2b5234c0f2fd774572ce`
BLAKE2b-256	`e8dc2b6cf17e0af833307bd0fdb92a7ec376d62d25676c0da95555728803ac7c`

See more details on using hashes here.

File details

Details for the file kamiwaza_mlx-0.2.7-py3-none-any.whl.

File metadata

Download URL: kamiwaza_mlx-0.2.7-py3-none-any.whl
Upload date: Jul 6, 2026
Size: 79.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for kamiwaza_mlx-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b22ee49e10e54f558990dc42d81db7a279e91148e4dc4b2e2b10175d3f2b357`
MD5	`401a0061cf424db21583bb3d61bf8e6a`
BLAKE2b-256	`8af3a270b982c1b74c23c044aff5290f550b8f880f09008ba8ca9ed28f8f2eab`

See more details on using hashes here.

kamiwaza-mlx 0.2.7

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Project description

Kamiwaza-MLX 📦

Current MLX Stack

Experimental DFlash speculative decoding

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

✨ Highlight reel

🚀 Running the server

Most useful flags:

KV cache flags (all KV-related CLI knobs)

Experimental multi-node via `mlx.distributed`

💬 Talking to it with the CLI

Interactive keys

🌐 HTTP API

🛠️ Internals (two-sentence tour)

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

kamiwaza-mlx 0.2.7

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Project description

Kamiwaza-MLX 📦

Current MLX Stack

Experimental DFlash speculative decoding

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

✨ Highlight reel

🚀 Running the server

Most useful flags:

KV cache flags (all KV-related CLI knobs)

Experimental multi-node via mlx.distributed

💬 Talking to it with the CLI

Interactive keys

🌐 HTTP API

🛠️ Internals (two-sentence tour)

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Experimental multi-node via `mlx.distributed`