Skip to main content

Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints

Project description

Kamiwaza-MLX 📦

A simple openai (chat.completions) compatible mlx server that:

  • Supports both vision models (via flag or model name detection) and text-only models
  • Supports streaming boolean flag
  • Has a --strip-thinking which will remove tag (in both streaming and not) - good for backwards compat
  • Supports usage to the client in openai style
  • Prints usage on the server side output
  • Appears to deliver reasonably good performance across all paths (streaming/not, vision/not)
  • Has a terminal client that works with the server, which also support syntax like image:/Users/matt/path/to/image.png Describe this image in detail

Tested largely with Qwen2.5-VL and Qwen3 models

Note: Not specific to Kamiwaza (that is, you can use on any Mac, Kamiwaza not required)

pip install kamiwaza-mlx

# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000

# chat from another terminal
python -m kamiwaza_mlx.infer -p "Say hello"

The remainder of this README documents the original features in more detail.

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

A FastAPI micro-server (server.py) that speaks the OpenAI /v1/chat/completions dialect, plus a tiny CLI client (infer.py) for quick experiments. Ideal for poking at huge models like Dracarys-72B on an M4-Max/Studio, hacking on prompts, or piping the output straight into other tools that already understand the OpenAI schema.


✨ Highlight reel

Feature Details
🔌 OpenAI compatible Same request / response JSON (streaming too) – just change the base-URL.
📦 Zero-config Point at a local folder or HuggingFace repo (-m /path/to/model).
🖼️ Vision-ready Accepts {"type":"image_url", …} parts & base64 URLs – works with Qwen-VL & friends.
🎥 Video-aware Auto-extracts N key-frames with ffmpeg and feeds them as images.
🧮 Usage metrics Prompt / completion tokens + tokens-per-second in every response.
⚙️ CLI playground infer.py gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag…

🚀 Running the server

# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit

# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345

Default host/port: 0.0.0.0:18000

Most useful flags:

Flag Default What it does
-m / --model mlx-community/Qwen2-VL-2B-Instruct-4bit Path or HF repo.
--host 0.0.0.0 Network interface to bind to.
--port 18000 TCP port to listen on.
-V / --vision off Force vision pipeline; otherwise auto-detect.
--strip-thinking off Removes <think>…</think> blocks from model output.

💬 Talking to it with the CLI

python infer.py --base-url http://localhost:18000/v1 -v --max_new_tokens 2048

Interactive keys

  • Ctrl-N: reset conversation
  • Ctrl-C: quit

🌐 HTTP API

GET /v1/models

Returns a list with the currently loaded model:

{
  "object": "list",
  "data": [
    {
      "id": "Dracarys2-72B-Instruct-4bit",
      "object": "model",
      "created": 1727389042,
      "owned_by": "kamiwaza"
    }
  ]
}

The created field is set when the server starts and mirrors the OpenAI API's timestamp.

POST /v1/chat/completions

{
  "model": "Dracarys2-72B-Instruct-4bit",
  "messages": [
    { "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image." },
        { "type": "image_url",
          "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ],
  "max_tokens": 512,
  "stream": false
}

Response (truncated):

{
  "id": "chatcmpl-d4c5…",
  "object": "chat.completion",
  "created": 1715242800,
  "model": "Dracarys2-72B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "The image shows…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 143,
    "completion_tokens": 87,
    "total_tokens": 230,
    "tokens_per_second": 32.1
  }
}

Add "stream": true and you'll get Server-Sent Events chunks followed by data: [DONE].


🛠️ Internals (two-sentence tour)

  • server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back.
  • infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.

That's it – drop it in front of any MLX model and start chatting!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kamiwaza_mlx-0.1.4.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kamiwaza_mlx-0.1.4-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file kamiwaza_mlx-0.1.4.tar.gz.

File metadata

  • Download URL: kamiwaza_mlx-0.1.4.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.4.tar.gz
Algorithm Hash digest
SHA256 5b19d16b6477492c0f950e59edf19fb42c65b63ee8646f5bf2fa3f39d5581672
MD5 f258c69747c67b2e0a6a9fb8a5b08f0b
BLAKE2b-256 1d35f5a463aa5fa29716908e1041ee6973e7b2865cb4747adc77747f91486ec7

See more details on using hashes here.

File details

Details for the file kamiwaza_mlx-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: kamiwaza_mlx-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ab73ecb1bc9ba88326a115beec216efc109960497b5f0651ef04dc2e01c6f535
MD5 4fff673aa5c336b6fa74e5ec7065d511
BLAKE2b-256 ae43aacaf852c0271cac293aa66bb4c40be7f4ad1281483da981b74dfff5f6fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page