Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints

Project description

Kamiwaza-MLX 📦

A simple openai (chat.completions) compatible mlx server that:

Supports both vision models (via flag or model name detection) and text-only models
Supports streaming boolean flag
Has a --strip-thinking which will remove tag (in both streaming and not) - good for backwards compat
Supports usage to the client in openai style
Prints usage on the server side output
Appears to deliver reasonably good performance across all paths (streaming/not, vision/not)
Has a terminal client that works with the server, which also support syntax like image:/Users/matt/path/to/image.png Describe this image in detail

Tested largely with Qwen2.5-VL and Qwen3 models

Note: Not specific to Kamiwaza (that is, you can use on any Mac, Kamiwaza not required)

pip install kamiwaza-mlx

# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000

# chat from another terminal
python -m kamiwaza_mlx.infer -p "Say hello"

The remainder of this README documents the original features in more detail.

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

A FastAPI micro-server (server.py) that speaks the OpenAI /v1/chat/completions dialect, plus a tiny CLI client (infer.py) for quick experiments. Ideal for poking at huge models like Dracarys-72B on an M4-Max/Studio, hacking on prompts, or piping the output straight into other tools that already understand the OpenAI schema.

✨ Highlight reel

Feature	Details
🔌 OpenAI compatible	Same request / response JSON (streaming too) – just change the base-URL.
📦 Zero-config	Point at a local folder or HuggingFace repo (`-m /path/to/model`).
🖼️ Vision-ready	Accepts `{"type":"image_url", …}` parts & base64 URLs – works with Qwen-VL & friends.
🎥 Video-aware	Auto-extracts N key-frames with ffmpeg and feeds them as images.
🧮 Usage metrics	Prompt / completion tokens + tokens-per-second in every response.
⚙️ CLI playground	`infer.py` gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag…

🚀 Running the server

# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit

# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345

Default host/port: 0.0.0.0:18000

Most useful flags:

Flag	Default	What it does
`-m / --model`	`mlx-community/Qwen2-VL-2B-Instruct-4bit`	Path or HF repo.
`--host`	`0.0.0.0`	Network interface to bind to.
`--port`	`18000`	TCP port to listen on.
`-V / --vision`	off	Force vision pipeline; otherwise auto-detect.
`--strip-thinking`	off	Removes `<think>…</think>` blocks from model output.
`--enable-prefix-caching`	`True`	Enable automatic prompt caching for text-only models. If enabled, the server attempts to load a cache from a model-specific file in `--prompt-cache-dir`. If not found, it creates one from the first processed prompt and saves it.
`--prompt-cache-dir`	`./.cache/mlx_prompt_caches/`	Directory to store/load automatic prompt cache files. Cache filenames are derived from the model name.

💬 Talking to it with the CLI

python infer.py --base-url http://localhost:18000/v1 -v --max_new_tokens 2048

Interactive keys

Ctrl-N: reset conversation
Ctrl-C: quit

🌐 HTTP API

GET /v1/models

Returns a list with the currently loaded model:

{
  "object": "list",
  "data": [
    {
      "id": "Dracarys2-72B-Instruct-4bit",
      "object": "model",
      "created": 1727389042,
      "owned_by": "kamiwaza"
    }
  ]
}

The created field is set when the server starts and mirrors the OpenAI API's timestamp.

POST /v1/chat/completions

{
  "model": "Dracarys2-72B-Instruct-4bit",
  "messages": [
    { "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image." },
        { "type": "image_url",
          "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ],
  "max_tokens": 512,
  "stream": false
}

Response (truncated):

{
  "id": "chatcmpl-d4c5…",
  "object": "chat.completion",
  "created": 1715242800,
  "model": "Dracarys2-72B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "The image shows…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 143,
    "completion_tokens": 87,
    "total_tokens": 230,
    "tokens_per_second": 32.1
  }
}

Add "stream": true and you'll get Server-Sent Events chunks followed by data: [DONE].

Prompt Caching (Text-Only Models):

Automatic prompt caching is controlled by server startup flags:
- --enable-prefix-caching (defaults to True): When enabled, the server will cache system messages for reuse across requests.
- --prompt-cache-dir (defaults to ./.cache/mlx_prompt_caches/): This directory is used to store and load cache files. Cache filenames are automatically generated based on the model name (e.g., Qwen3-8B-4bit.safetensors).
Behavior:
1. The server caches only the system message portion of conversations, not the entire prompt.
2. When a request contains a system message, the server:
  - Creates a cache of the system message on first use
  - Reuses this cache for subsequent requests with the same system message
  - Only processes the new user messages, dramatically improving performance
3. The cache is automatically discarded and recreated if the system message changes.
4. This is ideal for scenarios like:
  - Chatbots with fixed system prompts
  - Question-answering over long documents (document in system message)
  - Any use case where the system context remains constant across requests
Example: If your system message contains a 10,000 token document, only the first request processes all tokens. Subsequent questions about the document only process the new user message tokens.
This process is transparent to the API client; no special parameters are needed.
This feature is only applicable to text-only models.

🛠️ Internals (two-sentence tour)

server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back. For text-only models, if enabled via server flags, it automatically manages a system message cache to speed up processing when multiple queries reference the same system context.
infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.

That's it – drop it in front of any MLX model and start chatting!

Project details

Release history Release notifications | RSS feed

0.2.5

Feb 6, 2026

0.2.4

Jan 29, 2026

0.2.3

Jan 2, 2026

0.2.2

Dec 29, 2025

0.2.1

Dec 27, 2025

0.2.1b20251229053655 pre-release

Dec 29, 2025

0.2.1b20251226203154 pre-release

Dec 27, 2025

0.1.9

Nov 13, 2025

0.1.8

Nov 3, 2025

0.1.8b20251031234626 pre-release

Nov 3, 2025

0.1.7

Oct 9, 2025

0.1.6

Sep 21, 2025

0.1.6b20251008223113 pre-release

Oct 9, 2025

This version

0.1.5

Jun 9, 2025

0.1.4

Jun 9, 2025

0.1.3

May 16, 2025

0.1.2

May 16, 2025

0.1.1

May 16, 2025

0.1.0

May 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kamiwaza_mlx-0.1.5.tar.gz (21.7 kB view details)

Uploaded Jun 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kamiwaza_mlx-0.1.5-py3-none-any.whl (19.8 kB view details)

Uploaded Jun 9, 2025 Python 3

File details

Details for the file kamiwaza_mlx-0.1.5.tar.gz.

File metadata

Download URL: kamiwaza_mlx-0.1.5.tar.gz
Upload date: Jun 9, 2025
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`208b7cc4848b7d47db88121aee091f6591d896e62432e05389628c6008e4ade9`
MD5	`314fc9d8c9389d16883ef122caaeed68`
BLAKE2b-256	`dcdd2d623bd925a67d0e11822b75ce5b2e06b0c5c482ec3de318a5c17eba9660`

See more details on using hashes here.

File details

Details for the file kamiwaza_mlx-0.1.5-py3-none-any.whl.

File metadata

Download URL: kamiwaza_mlx-0.1.5-py3-none-any.whl
Upload date: Jun 9, 2025
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6e944684696f1de3268ed006d77a67a44c7a8769cb9791d4f55f385a758134e6`
MD5	`7223b341962bc83ef6bbfb12e1c69a3f`
BLAKE2b-256	`66b42a0c837d5d8846d219eea808e286c033840ada0d05dd7156a6c259f181e7`

See more details on using hashes here.

kamiwaza-mlx 0.1.5

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Project description

Kamiwaza-MLX 📦

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

✨ Highlight reel

🚀 Running the server

Most useful flags:

💬 Talking to it with the CLI

Interactive keys

🌐 HTTP API

🛠️ Internals (two-sentence tour)

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes