Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints
Project description
Kamiwaza-MLX 📦
A simple openai (chat.completions) compatible mlx server that:
- Supports both vision models (via flag or model name detection) and text-only models
- Supports streaming boolean flag
- Has a --strip-thinking which will remove tag (in both streaming and not) - good for backwards compat
- Supports usage to the client in openai style
- Prints usage on the server side output
- Appears to deliver reasonably good performance across all paths (streaming/not, vision/not)
- Has a terminal client that works with the server, which also support syntax like
image:/Users/matt/path/to/image.png Describe this image in detail - Experimental multi-node execution via
mlx.distributedwhenPAIRED_HOSTis provided
Tested largely with Qwen2.5-VL and Qwen3 models
Note: Not specific to Kamiwaza (that is, you can use on any Mac, Kamiwaza not required)
pip install kamiwaza-mlx
# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000
# chat from another terminal (note: specify --host to match server port)
python -m kamiwaza_mlx.infer --host localhost:18000 -p "Say hello"
The remainder of this README documents the original features in more detail.
MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model
A FastAPI micro-server (server.py) that speaks the OpenAI
/v1/chat/completions dialect, plus a tiny CLI client
(infer.py) for quick experiments.
Ideal for poking at huge models like Dracarys-72B on an
M4-Max/Studio, hacking on prompts, or piping the output straight into
other tools that already understand the OpenAI schema.
✨ Highlight reel
| Feature | Details |
|---|---|
| 🔌 OpenAI compatible | Same request / response JSON (streaming too) – just change the base-URL. |
| 📦 Zero-config | Point at a local folder or HuggingFace repo (-m /path/to/model). |
| 🖼️ Vision-ready | Accepts {"type":"image_url", …} parts & base64 URLs – works with Qwen-VL & friends. |
| 🎥 Video-aware | Auto-extracts N key-frames with ffmpeg and feeds them as images. |
| 🧮 Usage metrics | Prompt / completion tokens + tokens-per-second in every response. |
| ⚙️ CLI playground | infer.py gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag… |
🚀 Running the server
# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit
# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345
Default host/port: 0.0.0.0:18000
Most useful flags:
| Flag | Default | What it does |
|---|---|---|
-m / --model |
mlx-community/Qwen2-VL-2B-Instruct-4bit |
Path or HF repo. |
--host |
0.0.0.0 |
Network interface to bind to. |
--port |
18000 |
TCP port to listen on. |
-V / --vision |
off | Force vision pipeline; otherwise auto-detect. |
--strip-thinking |
off | Removes <think>…</think> blocks from model output. |
--enable-prefix-caching |
True |
Enable automatic prompt caching for text-only models. If enabled, the server attempts to load a cache from a model-specific file in --prompt-cache-dir. If not found, it creates one from the first processed prompt and saves it. |
--prompt-cache-dir |
./.cache/mlx_prompt_caches/ |
Directory to store/load automatic prompt cache files. Cache filenames are derived from the model name. |
KV cache flags (all KV-related CLI knobs)
System-prefix cache (system-only)
| Flag | Default | What it does |
|---|---|---|
--enable-prefix-caching |
True |
Enable system-prefix caching for text-only models. |
--prompt-cache-dir |
./.cache/mlx_prompt_caches/ |
Directory for system-prefix cache files (.safetensors, .len, .hash). |
--system-cache-max-tokens |
2048 |
Max tokens to cache from the system prompt (0 = unlimited). If the system prompt exceeds this cap, system caching is skipped. |
--prefix-cache-headroom |
64 |
Extra tokens reserved beyond the system prompt length when sizing the system cache. |
Conversation cache (simple global prefix-match)
Simple mode uses a single global prefix-match KV cache. If the incoming prompt shares the previous prompt as a prefix, we skip prefill for the cached portion; otherwise the cache is reset. Conversation IDs are used for logging/metadata only (they do not create separate caches).
| Flag | Default | What it does |
|---|---|---|
--disable-kv-cache |
off | Disable all KV caching (prefix + conversation). |
--kv-cache-max-tokens |
0 |
Per-cache upper bound (0 = min(model context, 128k)). |
--kv-cache-keep |
4 |
Tokens to keep when trimming rotating caches. |
--kv-cache-idle-release-seconds |
0 |
Release KV caches after idle time (0 disables). |
--kv-cache-hard-reserve |
True |
Fully reserve KV memory up to sizing target. |
--kv-cache-warmup |
False |
Run a warm-up pass at startup to materialize KV shapes. |
--kv-cache-warmup-tokens |
0 |
Warmup tokens (0 = use resolved target size). |
--retain-mx-cache |
False |
Keep MX allocator memory (disables mx.clear_cache). |
Experimental multi-node via mlx.distributed
The server can bootstrap a two-node mesh using mlx.distributed. Set a rendezvous host via PAIRED_HOST (optionally in a .env file) and launch each node with matching ranks/world-size. The helper will automatically read .env files passed via --distributed-env-file or located beside the server script.
# shared settings (either export or place in .env)
PAIRED_HOST=10.0.0.2
PAIRED_PORT=17863
WORLD_SIZE=2
# leader node (rank 0 hosts FastAPI)
RANK=0 python -m kamiwaza_mlx.server --distributed-env-file .env -m ./model
# worker node (rank 1 participates in mlx.distributed but does not bind HTTP)
RANK=1 python -m kamiwaza_mlx.server --distributed-env-file .env -m ./model
Useful knobs:
--distributed– force-enable/disable distributed mode (auto whenPAIRED_HOSTorWORLD_SIZE>1).--distributed-rank/--distributed-world-size– overrideRANK/WORLD_SIZEenv vars.--distributed-host/--distributed-port– overridePAIRED_HOST/PAIRED_PORT.--distributed-server-rank– choose which rank should host the HTTP server (defaults to 0).
Non-leader ranks simply keep the MLX runtime alive for collective ops once the model weights are synchronized.
💬 Talking to it with the CLI
python kamiwaza_mlx/infer.py --host localhost:18000 --max_new_tokens 2048
Interactive keys
- Ctrl-N: reset conversation
- Ctrl-C: quit
🌐 HTTP API
GET /v1/models
Returns a list with the currently loaded model:
{
"object": "list",
"data": [
{
"id": "Dracarys2-72B-Instruct-4bit",
"object": "model",
"created": 1727389042,
"owned_by": "kamiwaza"
}
]
}
The created field is set when the server starts and mirrors the OpenAI API's timestamp.
POST /v1/chat/completions
{
"model": "Dracarys2-72B-Instruct-4bit",
"messages": [
{ "role": "user",
"content": [
{ "type": "text", "text": "Describe this image." },
{ "type": "image_url",
"image_url": { "url": "data:image/jpeg;base64,..." } }
]
}
],
"max_tokens": 512,
"stream": false
}
Response (truncated):
{
"id": "chatcmpl-d4c5…",
"object": "chat.completion",
"created": 1715242800,
"model": "Dracarys2-72B-Instruct-4bit",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "The image shows…" },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 143,
"completion_tokens": 87,
"total_tokens": 230,
"tokens_per_second": 32.1
}
}
Add "stream": true and you'll get Server-Sent Events chunks followed by
data: [DONE].
System Prefix Caching (Text-Only Models):
- Purpose: Dramatically speed up repeated queries that share the same system context (e.g., large document in
role: system). The server caches only the system message(s), not the whole prompt, so subsequent turns process only new user tokens. - Flags:
--enable-prefix-caching(defaultTrue)--prompt-cache-dir(default./.cache/mlx_prompt_caches/)--system-cache-max-tokens(default2048,0disables the cap)--prefix-cache-headroom(default64)
- How it works (high‑level):
- On first request with a system message, the server builds a KV cache for just the system portion and saves three files under
--prompt-cache-dir:<model>.safetensors(KV),<model>.safetensors.len(token count),<model>.safetensors.hash(SHA256 over token IDs)
- On subsequent requests with the same system text (hash matches), the server deep‑copies the cached KV and processes only new user/assistant tokens.
- If the system message changes, the old cache is discarded and replaced automatically.
- If the system prompt exceeds
--system-cache-max-tokens, system caching is skipped and the full prompt is prefetched normally.
- On first request with a system message, the server builds a KV cache for just the system portion and saves three files under
- Example: A 10,000‑token system document is processed once; later questions only process the user tokens.
- Notes: text‑only models; fully transparent to clients (no special fields needed).
Conversation KV Caching (Long chats, fast follow‑ups):
- Rationale: Reuse KV across turns so only the tail of the prompt is prefetched.
- Behavior:
- Provide
conversationorconversation_id(orX-Conversation-Id) if you want IDs reflected in logs/metadata. If omitted, the server usesdefault. - The server returns headers for every request (JSON & SSE):
X-Conv-Id(resolved ID),X-Conv-KV(fresh|hit|reset|none|busy|disabled),X-Conv-Cached-Tokens,X-Conv-Processing-Tokens.
- Non‑stream JSON also includes
usage.input_tokens_details.cached_tokensandmetadata.conversation_id. - The global cache is capped at
min(model context, 128k)unless overridden by--kv-cache-max-tokens. - Concurrency: only one request at a time uses KV caches. When the cache is in use, other concurrent requests run with caching disabled (
X-Conv-KV: busy).
- Provide
- Breaking change: the legacy
/v1/conv_kv/*endpoints were removed in this branch.
🛠️ Internals (two-sentence tour)
- server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back. For text-only models, if enabled via server flags, it automatically manages a system message cache to speed up processing when multiple queries reference the same system context.
- infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.
That's it – drop it in front of any MLX model and start chatting!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kamiwaza_mlx-0.2.5.tar.gz.
File metadata
- Download URL: kamiwaza_mlx-0.2.5.tar.gz
- Upload date:
- Size: 51.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
997afce25c04d5540e701507bf021bd7f3aed7e5f31186c72e8d98c8993e2d6d
|
|
| MD5 |
c60bdb538f92496a8a5b588d0d013d54
|
|
| BLAKE2b-256 |
07ca6ba352c2cf10d00e3ff98c0f37fe196423fb2daf6abda8145e8744d149f8
|
File details
Details for the file kamiwaza_mlx-0.2.5-py3-none-any.whl.
File metadata
- Download URL: kamiwaza_mlx-0.2.5-py3-none-any.whl
- Upload date:
- Size: 49.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fc6ae2cb3adc36b157bbea3dc2b6f6ce76831c96725d6de1a5a816b3a08d6e1
|
|
| MD5 |
ceca97bdec91e8970ea771cdfbb73be4
|
|
| BLAKE2b-256 |
a384b2c0b1007b3922350c9f7b1a0b266191c4d1c7129ab29b3a9feea9165eac
|