A high-performance API server that provides OpenAI-compatible endpoints for MLX models.

These details have not been verified by PyPI

Project links

Homepage

Project description

mlx-openai-server

OpenAI-compatible API server for local MLX models on Apple Silicon. It serves text, multimodal, image, embedding, and Whisper models through familiar OpenAI SDK endpoints.

Requires macOS on Apple Silicon and Python 3.11+.

Feature Launch
Install
Start a Server
API Usage
Server Options
Multi-Model Config
Long Context and Metal OOM
Advanced LM Options
Troubleshooting
Examples and Demos

Feature Launch

Darwin 36B Opus is now available in MLX format for local text inference:

Original model: FINAL-Bench/Darwin-36B-Opus
MLX text-only 8-bit conversion: GiaHuy/Darwin-36B-Opus-mlx-text-only-8bit

Launch it with the required reasoning and tool-call parsers:

mlx-openai-server launch --model-path Darwin-36B-Opus-mlx-text-only-8bit --reasoning-parser qwen3_moe --tool-call-parser qwen3_coder --debug --served-model-name Darwin-36B-Opus

Darwin 36B Opus should be served with both --reasoning-parser qwen3_moe and --tool-call-parser qwen3_coder; without them, reasoning and tool-call output will not be parsed correctly.

Install

python3.11 -m venv .venv
source .venv/bin/activate

uv pip install mlx-openai-server

Install from GitHub instead:

uv pip install git+https://github.com/cubist38/mlx-openai-server.git

Whisper transcription also needs ffmpeg:

brew install ffmpeg

Start a Server

Text model:

mlx-openai-server launch \
  --model-type lm \
  --model-path mlx-community/Qwen3-Coder-Next-4bit \
  --reasoning-parser qwen3_moe \
  --tool-call-parser qwen3_coder

Point OpenAI-compatible clients to:

http://localhost:8000/v1

Use any non-empty API key, for example not-needed.

Common launch modes:

# Multimodal text/image/audio
mlx-openai-server launch \
  --model-type multimodal \
  --model-path <mlx-vlm-model>

# Image generation
mlx-openai-server launch \
  --model-type image-generation \
  --model-path <flux-or-qwen-image-model> \
  --config-name flux-dev \
  --quantize 8

# Image editing
mlx-openai-server launch \
  --model-type image-edit \
  --model-path <flux-or-qwen-image-edit-model> \
  --config-name flux-kontext-dev \
  --quantize 8

# Embeddings
mlx-openai-server launch \
  --model-type embeddings \
  --model-path <embedding-model>

# Whisper transcription
mlx-openai-server launch \
  --model-type whisper \
  --model-path mlx-community/whisper-large-v3-mlx

Supported model types:

Type	Backend	Endpoint family
`lm`	`mlx-lm`	chat, responses
`multimodal`	`mlx-vlm`	chat, responses
`image-generation`	`mflux`	image generation
`image-edit`	`mflux`	image editing
`embeddings`	`mlx-embeddings`	embeddings
`whisper`	`mlx-whisper`	audio transcription

Image --config-name values:

Generation: flux-schnell, flux-dev, flux-krea-dev, flux2-klein-4b, flux2-klein-9b, qwen-image, z-image-turbo, fibo
Editing: flux-kontext-dev, flux2-klein-edit-4b, flux2-klein-edit-9b, qwen-image-edit

API Usage

Chat

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mlx-community/Qwen3-Coder-Next-4bit",
    messages=[{"role": "user", "content": "Say hello in one sentence."}],
)
print(response.choices[0].message.content)

Vision

import base64
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.jpg", "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="local-multimodal",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}},
            ],
        }
    ],
)
print(response.choices[0].message.content)

Images

import base64
from io import BytesIO

import openai
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.images.generate(
    model="local-image-generation-model",
    prompt="A mountain lake at sunset",
    size="1024x1024",
)

image = Image.open(BytesIO(base64.b64decode(response.data[0].b64_json)))
image.show()

Embeddings

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.embeddings.create(
    model="local-embedding-model",
    input=["The quick brown fox jumps over the lazy dog"],
)
print(len(response.data[0].embedding))

Responses API

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.responses.create(
    model="local-model",
    input="Write a three sentence story.",
)

for item in response.output:
    if item.type == "message":
        for part in item.content:
            if getattr(part, "text", None):
                print(part.text)

Supported endpoints:

Endpoint	Model types
`GET /v1/models`	all
`POST /v1/chat/completions`	`lm`, `multimodal`
`POST /v1/responses`	`lm`, `multimodal`
`POST /v1/images/generations`	`image-generation`
`POST /v1/images/edits`	`image-edit`
`POST /v1/embeddings`	`embeddings`
`POST /v1/audio/transcriptions`	`whisper`

The request model should be the model path, --served-model-name, or YAML served_model_name.

Server Options

Option	Default	Notes
`--model-path`	required	Local path or Hugging Face repo
`--model-type`	`lm`	`lm`, `multimodal`, `image-generation`, `image-edit`, `embeddings`, `whisper`
`--served-model-name`	model path	Alias accepted in API requests
`--host`	`0.0.0.0`	Bind host
`--port`	`8000`	Bind port
`--context-length`	model default	LM and multimodal context/cache length
`--max-tokens`	`100000`	Default generated tokens when request omits `max_tokens`
`--temperature`	`1.0`	Default sampling temperature
`--top-p`	`1.0`	Default nucleus sampling
`--top-k`	`20`	Default top-k sampling
`--repetition-penalty`	`1.0`	Default repetition penalty
`--config-name`	model-dependent	Image model preset
`--quantize`	unset	Image model quantization: `4`, `8`, or `16`
`--log-level`	`INFO`	`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`--no-log-file`	`false`	Disable file logging

LM-specific memory and batching options:

Option	Default	Notes
`--decode-concurrency`	`32`	Max concurrent batch decode sequences
`--prompt-concurrency`	`8`	Max prompts prefilled together
`--prefill-step-size`	`2048`	Tokens per prefill step
`--prompt-cache-size`	`10`	Retained prompt KV cache entries
`--max-bytes`	unbounded	Prompt KV cache byte budget
`--prompt-cache-dir`	temp dir	Directory for disk-backed prompt KV cache payloads
`--kv-bits`	unset	KV cache quantization bits, usually `4` or `8`
`--kv-group-size`	`64`	KV quantization group size
`--quantized-kv-start`	`0`	Token step where KV quantization starts
`--draft-model-path`	unset	Smaller draft model for speculative decoding
`--num-draft-tokens`	`2`	Draft tokens proposed per step

Multi-Model Config

Use YAML when you want several models behind one server:

mlx-openai-server launch --config config.yaml

Example:

server:
  host: "0.0.0.0"
  port: 8000
  log_level: INFO

models:
  - model_path: mlx-community/MiniMax-M2.5-4bit
    model_type: lm
    served_model_name: minimax
    tool_call_parser: minimax_m2
    reasoning_parser: minimax_m2

  - model_path: black-forest-labs/FLUX.2-klein-4B
    model_type: image-generation
    served_model_name: flux2-klein-4b
    config_name: flux2-klein-4b
    quantize: 4
    on_demand: true
    on_demand_idle_timeout: 120

Important YAML keys:

Key	Notes
`model_path`, `model_type`, `served_model_name`	Model identity and routing
`context_length`	LM/multimodal context length
`prompt_cache_size`, `prompt_cache_max_bytes`, `prompt_cache_dir`	Prompt KV cache limits and disk location
`batch_completion_size`, `batch_prefill_size`, `batch_prefill_step_size`	Continuous batching limits
`kv_bits`, `kv_group_size`, `quantized_kv_start`	KV cache quantization
`default_max_tokens`	Default generated tokens
`on_demand`, `on_demand_idle_timeout`	Load large models only when requested

In multi-model mode, each model runs in a spawned subprocess. This isolates MLX/Metal runtime state and avoids process-fork semaphore issues on macOS.

Long Context and Metal OOM

Large prompts, high concurrency, and long generations all increase KV-cache memory. If Metal runs out of memory, macOS may terminate the process with:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)

Start with conservative settings:

mlx-openai-server launch \
  --model-type lm \
  --model-path <model-path> \
  --context-length 8192 \
  --decode-concurrency 4 \
  --prompt-concurrency 1 \
  --prefill-step-size 512 \
  --max-tokens 2048 \
  --prompt-cache-size 1 \
  --max-bytes 2147483648 \
  --kv-bits 4 \
  --kv-group-size 64 \
  --quantized-kv-start 0

Tune in this order:

Lower --prompt-concurrency and --prefill-step-size to reduce prefill spikes.
Lower --decode-concurrency to reduce active KV caches.
Lower --max-tokens to bound generation growth.
Lower --prompt-cache-size and set --max-bytes to limit retained caches.
Use --kv-bits 4 for supported LM/multimodal models.
Reduce --context-length if the model still does not fit.

YAML equivalent:

models:
  - model_path: <model-path>
    model_type: lm
    context_length: 8192
    batch_completion_size: 4
    batch_prefill_size: 1
    batch_prefill_step_size: 512
    default_max_tokens: 2048
    prompt_cache_size: 1
    prompt_cache_max_bytes: 2147483648
    kv_bits: 4
    kv_group_size: 64
    quantized_kv_start: 0

For large models on macOS 15+, you can also raise the wired memory limit:

bash configure_mlx.sh

Advanced LM Options

Tool and Reasoning Parsers

Some models need parser flags for tool calls or reasoning blocks:

mlx-openai-server launch \
  --model-type lm \
  --model-path <model> \
  --tool-call-parser qwen3 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice

Common parser names include qwen3, qwen3_5, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, and minimax_m2.

Message converters are auto-detected from parser selection when a compatible converter exists.

Custom Chat Templates

mlx-openai-server launch \
  --model-type lm \
  --model-path <model> \
  --chat-template-file /path/to/template.jinja

Speculative Decoding

mlx-openai-server launch \
  --model-type lm \
  --model-path <main-model> \
  --draft-model-path <smaller-draft-model> \
  --num-draft-tokens 4

Speculative decoding is only available for lm. It is not used by the continuous batch path.

Structured Outputs

Chat completions accept OpenAI-style response_format JSON schema. The Responses API also supports client.responses.parse() with Pydantic models. See examples/structured_outputs_examples.ipynb and examples/responses_api.ipynb.

Troubleshooting

Issue	Fix
Model does not fit in memory	Use a smaller or pre-quantized model, lower `--context-length`, and see Long Context and Metal OOM.
Metal OOM during batching	Lower `--prompt-concurrency`, `--prefill-step-size`, `--decode-concurrency`, and `--max-tokens`.
Port already in use	Pass `--port 8001` or another free port.
Image model memory is too high	Use `--quantize 4` or `--quantize 8`.
Model loading says parameters are missing or unexpected	Upgrade the backend package from source.
Hugging Face download fails	Check network access and Hugging Face authentication for gated models.

Upgrade backend packages for newly released model architectures:

uv pip install git+https://github.com/ml-explore/mlx-lm.git
uv pip install git+https://github.com/Blaizzy/mlx-vlm.git
uv pip install git+https://github.com/Blaizzy/mlx-embeddings.git

Examples and Demos

Example notebooks live in examples/:

Area	Notebooks
Text and Responses API	`responses_api.ipynb`, `simple_rag_demo.ipynb`
Vision	`vision_examples.ipynb`
Audio	`audio_examples.ipynb`, `transcription_examples.ipynb`
Embeddings	`embedding_examples.ipynb`, `lm_embeddings_examples.ipynb`, `vlm_embeddings_examples.ipynb`
Images	`image_generations.ipynb`, `image_edit.ipynb`
Structured outputs	`structured_outputs_examples.ipynb`

Demos:

Contributing

Fork the repository.
Create a feature branch.
Make changes with tests.
Submit a pull request.

Use conventional commit prefixes when possible, for example fix:, feat:, docs:, or test:.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
License: MIT

Acknowledgments

Built on MLX, mlx-lm, mlx-vlm, mlx-embeddings, mflux, mlx-whisper, and mlx-community.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.8.1

May 3, 2026

This version

1.8.0

Apr 26, 2026

1.7.1

Apr 4, 2026

1.7.0

Mar 22, 2026

1.6.3

Mar 8, 2026

1.6.2

Mar 6, 2026

1.6.1

Feb 23, 2026

1.6.0

Feb 22, 2026

1.5.3

Feb 12, 2026

1.5.2

Feb 7, 2026

1.5.1

Jan 27, 2026

1.5.0

Jan 15, 2026

1.4.2

Dec 9, 2025

1.4.1

Dec 5, 2025

1.4.0

Dec 5, 2025

1.3.12

Oct 20, 2025

1.3.11

Oct 19, 2025

1.3.10

Oct 19, 2025

1.3.9

Oct 15, 2025

1.3.8

Oct 11, 2025

1.3.7

Oct 7, 2025

1.3.6

Oct 6, 2025

1.3.5

Oct 3, 2025

1.3.4

Sep 18, 2025

1.3.3

Sep 13, 2025

1.3.2

Sep 8, 2025

1.3.1

Sep 2, 2025

1.3.0

Sep 2, 2025

1.2.19

Aug 18, 2025

1.2.18

Aug 17, 2025

1.2.17

Aug 16, 2025

1.2.16

Aug 16, 2025

1.2.15

Aug 15, 2025

1.2.14

Aug 15, 2025

1.2.13

Aug 14, 2025

1.2.11

Aug 14, 2025

1.2.10

Aug 14, 2025

1.2.9

Aug 12, 2025

1.2.8

Aug 12, 2025

1.2.7

Aug 12, 2025

1.2.6

Aug 8, 2025

1.2.4

Aug 3, 2025

1.2.3

Jul 25, 2025

1.2.2

Jul 20, 2025

1.2.1

Jul 18, 2025

1.2.0

Jul 12, 2025

1.1.2

Jul 7, 2025

1.1.1

Jul 7, 2025

1.1.0

Jul 6, 2025

1.0.14

Jul 6, 2025

1.0.13

Jul 3, 2025

1.0.11

Jun 12, 2025

1.0.10

Jun 9, 2025

1.0.9

May 31, 2025

1.0.8

May 24, 2025

1.0.7

May 24, 2025

1.0.6

May 20, 2025

1.0.5

May 13, 2025

1.0.4

May 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_openai_server-1.8.0.tar.gz (187.6 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_openai_server-1.8.0-py3-none-any.whl (176.8 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file mlx_openai_server-1.8.0.tar.gz.

File metadata

Download URL: mlx_openai_server-1.8.0.tar.gz
Upload date: Apr 26, 2026
Size: 187.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mlx_openai_server-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`5d82546b81c4bda18684440fce665ec1b16814fa3850dfa000a52c98611971df`
MD5	`01d18b7b52d747ceae02d43f84a7bf64`
BLAKE2b-256	`f540b9074f0f1f2a42d9e33650720e84c5d036afcacf80ee72d75ff444611294`

See more details on using hashes here.

File details

Details for the file mlx_openai_server-1.8.0-py3-none-any.whl.

File metadata

Download URL: mlx_openai_server-1.8.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 176.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mlx_openai_server-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`937df511a8ef67630e661f6d3b4c273f9bbc2c89fa383c2eafbaae3c0c12ffde`
MD5	`293e29572644aa13216ae08223c5bb7a`
BLAKE2b-256	`216113e182021d8692b9b77c1571b044e6cf9681542ef302b5a2b58da68af06d`

See more details on using hashes here.

mlx-openai-server 1.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-openai-server

Contents

Feature Launch

Install

Start a Server

API Usage

Chat

Vision

Images

Embeddings

Responses API

Server Options

Multi-Model Config

Long Context and Metal OOM

Advanced LM Options

Tool and Reasoning Parsers

Custom Chat Templates

Speculative Decoding

Structured Outputs

Troubleshooting

Examples and Demos

Contributing

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes