Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints
Project description
Kamiwaza-MLX 📦
A simple openai (chat.completions) compatible mlx server that:
- Supports both vision models (via flag or model name detection) and text-only models
- Supports streaming boolean flag
- Has a --strip-thinking which will remove tag (in both streaming and not) - good for backwards compat
- Supports usage to the client in openai style
- Prints usage on the server side output
- Appears to deliver reasonably good performance across all paths (streaming/not, vision/not)
- Has a terminal client that works with the server, which also support syntax like
image:/Users/matt/path/to/image.png Describe this image in detail
Tested largely with Qwen2.5-VL and Qwen3 models
Note: Not specific to Kamiwaza (that is, you can use on any Mac, Kamiwaza not required)
pip install kamiwaza-mlx
# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000
# chat from another terminal
python -m kamiwaza_mlx.infer -p "Say hello"
The remainder of this README documents the original features in more detail.
MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model
A FastAPI micro-server (server.py) that speaks the OpenAI
/v1/chat/completions dialect, plus a tiny CLI client
(infer.py) for quick experiments.
Ideal for poking at huge models like Dracarys-72B on an
M4-Max/Studio, hacking on prompts, or piping the output straight into
other tools that already understand the OpenAI schema.
✨ Highlight reel
| Feature | Details |
|---|---|
| 🔌 OpenAI compatible | Same request / response JSON (streaming too) – just change the base-URL. |
| 📦 Zero-config | Point at a local folder or HuggingFace repo (-m /path/to/model). |
| 🖼️ Vision-ready | Accepts {"type":"image_url", …} parts & base64 URLs – works with Qwen-VL & friends. |
| 🎥 Video-aware | Auto-extracts N key-frames with ffmpeg and feeds them as images. |
| 🧮 Usage metrics | Prompt / completion tokens + tokens-per-second in every response. |
| ⚙️ CLI playground | infer.py gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag… |
🚀 Running the server
# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit
# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345
Default host/port: 0.0.0.0:18000
Most useful flags:
| Flag | Default | What it does |
|---|---|---|
-m / --model |
mlx-community/Qwen2-VL-2B-Instruct-4bit |
Path or HF repo. |
--host |
0.0.0.0 |
Network interface to bind to. |
--port |
18000 |
TCP port to listen on. |
-V / --vision |
off | Force vision pipeline; otherwise auto-detect. |
--strip-thinking |
off | Removes <think>…</think> blocks from model output. |
💬 Talking to it with the CLI
python infer.py --base-url http://localhost:18000/v1 -v --max_new_tokens 2048
Interactive keys
- Ctrl-N: reset conversation
- Ctrl-C: quit
🌐 HTTP API
GET /v1/models
Returns a list with the currently loaded model:
{
"object": "list",
"data": [
{
"id": "Dracarys2-72B-Instruct-4bit",
"object": "model",
"created": 1727389042,
"owned_by": "kamiwaza"
}
]
}
The created field is set when the server starts and mirrors the OpenAI API's timestamp.
POST /v1/chat/completions
{
"model": "Dracarys2-72B-Instruct-4bit",
"messages": [
{ "role": "user",
"content": [
{ "type": "text", "text": "Describe this image." },
{ "type": "image_url",
"image_url": { "url": "data:image/jpeg;base64,..." } }
]
}
],
"max_tokens": 512,
"stream": false
}
Response (truncated):
{
"id": "chatcmpl-d4c5…",
"object": "chat.completion",
"created": 1715242800,
"model": "Dracarys2-72B-Instruct-4bit",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "The image shows…" },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 143,
"completion_tokens": 87,
"total_tokens": 230,
"tokens_per_second": 32.1
}
}
Add "stream": true and you'll get Server-Sent Events chunks followed by
data: [DONE].
🛠️ Internals (two-sentence tour)
- server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back.
- infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.
That's it – drop it in front of any MLX model and start chatting!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kamiwaza_mlx-0.1.4.tar.gz.
File metadata
- Download URL: kamiwaza_mlx-0.1.4.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b19d16b6477492c0f950e59edf19fb42c65b63ee8646f5bf2fa3f39d5581672
|
|
| MD5 |
f258c69747c67b2e0a6a9fb8a5b08f0b
|
|
| BLAKE2b-256 |
1d35f5a463aa5fa29716908e1041ee6973e7b2865cb4747adc77747f91486ec7
|
File details
Details for the file kamiwaza_mlx-0.1.4-py3-none-any.whl.
File metadata
- Download URL: kamiwaza_mlx-0.1.4-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab73ecb1bc9ba88326a115beec216efc109960497b5f0651ef04dc2e01c6f535
|
|
| MD5 |
4fff673aa5c336b6fa74e5ec7065d511
|
|
| BLAKE2b-256 |
ae43aacaf852c0271cac293aa66bb4c40be7f4ad1281483da981b74dfff5f6fa
|