Skip to main content

Unified MLX server & CLI (language and vision) with OpenAI-compatible endpoints

Project description

Kamiwaza-MLX 📦

Easily installable PyPI wrapper around the original MLX server / client scripts. Once published you can simply do:

pip install kamiwaza-mlx

# start the server
a) python -m kamiwaza_mlx.server -m ./path/to/model --port 18000
# or, if you enabled the optional entry-points during install
b) kamiwaza-mlx-server -m ./path/to/model --port 18000

# chat from another terminal
python -m kamiwaza_mlx.infer -p "Say hello"

The remainder of this README documents the original features in more detail.

MLX-LM 🦙 — Drop-in OpenAI-style API for any local MLX model

A FastAPI micro-server (server.py) that speaks the OpenAI /v1/chat/completions dialect, plus a tiny CLI client (infer.py) for quick experiments. Ideal for poking at huge models like Dracarys-72B on an M4-Max/Studio, hacking on prompts, or piping the output straight into other tools that already understand the OpenAI schema.


✨ Highlight reel

Feature Details
🔌 OpenAI compatible Same request / response JSON (streaming too) – just change the base-URL.
📦 Zero-config Point at a local folder or HuggingFace repo (-m /path/to/model).
🖼️ Vision-ready Accepts {"type":"image_url", …} parts & base64 URLs – works with Qwen-VL & friends.
🎥 Video-aware Auto-extracts N key-frames with ffmpeg and feeds them as images.
🧮 Usage metrics Prompt / completion tokens + tokens-per-second in every response.
⚙️ CLI playground infer.py gives you a REPL with reset (Ctrl-N), verbose mode, max-token flag…

🚀 Running the server

# minimal
python server.py -m /var/tmp/models/mlx-community/Dracarys2-72B-Instruct-4bit

# custom port / host
python server.py -m ./Qwen2.5-VL-72B-Instruct-6bit --host 0.0.0.0 --port 12345

Default host/port: 0.0.0.0:18000

Most useful flags:

Flag Default What it does
-m / --model mlx-community/Qwen2-VL-2B-Instruct-4bit Path or HF repo.
--strip-thinking off Removes <think>…</think> blocks from model output.
--default-frames 4 How many frames to grab per video URL.
--col-a / --col-b qwen / vl Tokens that trigger the special Qwen-VL vision pipeline.

💬 Talking to it with the CLI

python infer.py --base-url http://localhost:18000/v1 -v --max_new_tokens 2048

Interactive keys

  • Ctrl-N: reset conversation
  • Ctrl-C: quit

🌐 HTTP API

POST /v1/chat/completions

{
  "model": "Dracarys2-72B-Instruct-4bit",
  "messages": [
    { "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image." },
        { "type": "image_url",
          "image_url": { "url": "data:image/jpeg;base64,..." } }
      ]
    }
  ],
  "max_tokens": 512,
  "stream": false
}

Response (truncated):

{
  "id": "chatcmpl-d4c5…",
  "object": "chat.completion",
  "created": 1715242800,
  "model": "Dracarys2-72B-Instruct-4bit",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "The image shows…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 143,
    "completion_tokens": 87,
    "total_tokens": 230,
    "tokens_per_second": 32.1
  }
}

Add "stream": true and you'll get Server-Sent Events chunks followed by data: [DONE].


🛠️ Internals (two-sentence tour)

  • server.py – loads the model with mlx-vlm, converts incoming OpenAI vision messages to the model's chat-template, handles images / video frames, and streams tokens back.
  • infer.py – lightweight REPL that keeps conversation context and shows latency / TPS stats.

That's it – drop it in front of any MLX model and start chatting!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kamiwaza_mlx-0.1.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kamiwaza_mlx-0.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file kamiwaza_mlx-0.1.0.tar.gz.

File metadata

  • Download URL: kamiwaza_mlx-0.1.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d3008d9ab3a41da21a10151c63253b7125309ba58969fed7f5f80951467cbde9
MD5 a9fbbd0cd33e826456646789c3825db2
BLAKE2b-256 35cfdff531850dc4b8280c706e059bc4626834b1d1e197d8a2a4a9bdd2d15584

See more details on using hashes here.

File details

Details for the file kamiwaza_mlx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kamiwaza_mlx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for kamiwaza_mlx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a1ead0c04bc4eec5ec67928c6640fd8452e6ab0fc43e781579f24c4c130152b
MD5 89a29260dfd3a1e8710745edf9e0bed2
BLAKE2b-256 a4aec165d36cd6362430ae7050a43e796cb8671853d37420295740711f56c1e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page