A high-performance API server that provides OpenAI-compatible endpoints for MLX models.
Project description
mlx-openai-server
A high-performance OpenAI-compatible API server for MLX models. Run text, vision, audio, and image generation models locally on Apple Silicon with a drop-in OpenAI replacement.
Note: Requires macOS with M-series chips (MLX is optimized for Apple Silicon).
Key Features
- 🚀 OpenAI-compatible API - Drop-in replacement for OpenAI services
- 🖼️ Multimodal support - Text, vision, audio, and image generation/editing
- 🎨 Flux-series models - Image generation (schnell, dev, krea-dev, flux-2-klein) and editing (kontext, qwen-image-edit)
- 🔌 Easy integration - Works with existing OpenAI client libraries
- ⚡ Performance - Configurable quantization (4/8/16-bit) and context length
- 🎛️ LoRA adapters - Fine-tuned image generation and editing
- 📈 Queue management - Built-in request queuing and monitoring
Installation
Prerequisites
- macOS with Apple Silicon (M-series)
- Python 3.11+
Quick Install
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install from PyPI
pip install mlx-openai-server
# Or install from GitHub
pip install git+https://github.com/cubist38/mlx-openai-server.git
Optional: Whisper Support
For audio transcription models, install ffmpeg:
brew install ffmpeg
Quick Start
Start the Server
# Text-only or multimodal models
mlx-openai-server launch \
--model-path <path-to-mlx-model> \
--model-type <lm|multimodal>
# Image generation (Flux-series)
mlx-openai-server launch \
--model-type image-generation \
--model-path <path-to-flux-model> \
--config-name flux-dev \
--quantize 8
# Image editing
mlx-openai-server launch \
--model-type image-edit \
--model-path <path-to-flux-model> \
--config-name flux-kontext-dev \
--quantize 8
# Embeddings
mlx-openai-server launch \
--model-type embeddings \
--model-path <embeddings-model-path>
# Whisper (audio transcription)
mlx-openai-server launch \
--model-type whisper \
--model-path mlx-community/whisper-large-v3-mlx
Server Parameters
--model-path: Path to MLX model (local or HuggingFace repo)--model-type:lm,multimodal,image-generation,image-edit,embeddings, orwhisper--config-name: For image models -flux-schnell,flux-dev,flux-krea-dev,flux-kontext-dev,flux2-klein-4b,flux2-klein-9b,qwen-image,qwen-image-edit,z-image-turbo,fibo--quantize: Quantization level -4,8, or16(image models)--context-length: Max sequence length for memory optimization--max-concurrency: Concurrent requests (default: 1)--queue-timeout: Request timeout in seconds (default: 300)--lora-paths: Comma-separated LoRA adapter paths (image models)--lora-scales: Comma-separated LoRA scales (must match paths)--log-level:DEBUG,INFO,WARNING,ERROR,CRITICAL(default:INFO)--no-log-file: Disable file logging (console only)
Supported Model Types
- Text-only (
lm) - Language models viamlx-lm - Multimodal (
multimodal) - Text, images, audio viamlx-vlm - Image generation (
image-generation) - Flux-series, Qwen Image, Z-Image Turbo, Fibo - Image editing (
image-edit) - Flux kontext, Qwen Image Edit - Embeddings (
embeddings) - Text embeddings viamlx-embeddings - Whisper (
whisper) - Audio transcription (requires ffmpeg)
Image Model Configurations
Generation:
flux-schnell- Fast (4 steps, no guidance)flux-dev- Balanced (25 steps, 3.5 guidance)flux-krea-dev- High quality (28 steps, 4.5 guidance)flux2-klein-4b/flux2-klein-9b- Flux 2 Klein modelsqwen-image- Qwen image generation (50 steps, 4.0 guidance)z-image-turbo- Z-Image Turbofibo- Fibo model
Editing:
flux-kontext-dev- Context-aware editing (28 steps, 2.5 guidance)flux2-klein-edit-4b/flux2-klein-edit-9b- Flux 2 Klein editingqwen-image-edit- Qwen image editing (50 steps, 4.0 guidance)
Using the API
The server provides OpenAI-compatible endpoints. Use standard OpenAI client libraries:
Text Completion
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)
Vision (Multimodal)
import openai
import base64
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
with open("image.jpg", "rb") as f:
base64_image = base64.b64encode(f.read()).decode('utf-8')
response = client.chat.completions.create(
model="local-multimodal",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
]
}]
)
print(response.choices[0].message.content)
Image Generation
import openai
import base64
from io import BytesIO
from PIL import Image
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.images.generate(
prompt="A serene landscape with mountains and a lake at sunset",
model="local-image-generation-model",
size="1024x1024"
)
image_data = base64.b64decode(response.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()
Image Editing
import openai
import base64
from io import BytesIO
from PIL import Image
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
with open("image.png", "rb") as f:
result = client.images.edit(
image=f,
prompt="make it like a photo in 1800s",
model="flux-kontext-dev"
)
image_data = base64.b64decode(result.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()
Function Calling
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
messages = [{"role": "user", "content": "What is the weather in Tokyo?"}]
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather in a given city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name"}
}
}
}
}]
completion = client.chat.completions.create(
model="local-model",
messages=messages,
tools=tools,
tool_choice="auto"
)
if completion.choices[0].message.tool_calls:
tool_call = completion.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Embeddings
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.embeddings.create(
model="local-model",
input=["The quick brown fox jumps over the lazy dog"]
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
Structured Outputs (JSON Schema)
import openai
import json
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response_format = {
"type": "json_schema",
"json_schema": {
"name": "Address",
"schema": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"zip": {"type": "string"}
},
"required": ["street", "city", "state", "zip"]
}
}
}
completion = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Format: 1 Hacker Wy Menlo Park CA 94025"}],
response_format=response_format
)
address = json.loads(completion.choices[0].message.content)
print(json.dumps(address, indent=2))
Advanced Configuration
Parser Configuration
For models requiring custom parsing (tool calls, reasoning):
mlx-openai-server launch \
--model-path <path-to-model> \
--model-type lm \
--tool-call-parser qwen3 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice
Available parsers: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2
Message Converters
For models requiring message format conversion:
mlx-openai-server launch \
--model-path <path-to-model> \
--model-type lm \
--message-converter glm4_moe
Available converters: glm4_moe, minimax_m2, nemotron3_nano, qwen3_coder
Custom Chat Templates
mlx-openai-server launch \
--model-path <path-to-model> \
--model-type lm \
--chat-template-file /path/to/template.jinja
Request Queue System
The server includes a request queue system with monitoring:
# Check queue status
curl http://localhost:8000/v1/queue/stats
Response:
{
"status": "ok",
"queue_stats": {
"running": true,
"queue_size": 3,
"max_queue_size": 100,
"active_requests": 1,
"max_concurrency": 1
}
}
Example Notebooks
Check the examples/ directory for comprehensive guides:
audio_examples.ipynb- Audio processingembedding_examples.ipynb- Text embeddingslm_embeddings_examples.ipynb- Language model embeddingsvlm_embeddings_examples.ipynb- Vision-language embeddingsvision_examples.ipynb- Vision capabilitiesimage_generations.ipynb- Image generationimage_edit.ipynb- Image editingstructured_outputs_examples.ipynb- JSON schema outputssimple_rag_demo.ipynb- RAG pipeline demo
Large Models
For models that don't fit in RAM, improve performance on macOS 15.0+:
bash configure_mlx.sh
This raises the system's wired memory limit for better performance.
Contributing
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
Follow Conventional Commits for commit messages.
Support
- Documentation: This README and example notebooks
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Video Tutorials: Setup Demo, RAG Demo
License
MIT License - see LICENSE file for details.
Acknowledgments
Built on top of:
- MLX - Apple's ML framework
- mlx-lm - Language models
- mlx-vlm - Multimodal models
- mlx-embeddings - Embeddings
- mflux - Flux image models
- mlx-whisper - Audio transcription
- mlx-community - Model repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_openai_server-1.5.2.tar.gz.
File metadata
- Download URL: mlx_openai_server-1.5.2.tar.gz
- Upload date:
- Size: 88.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0aecbc80646f7ee3837891313712259f22d3d1c95fa3c8ce73cb74bd596a976
|
|
| MD5 |
25debd6c69b113244afd5b9d59363c7d
|
|
| BLAKE2b-256 |
2458ac30499cfbed3be8e08b9b6316cbb02a8b2e84c1ebdd11105ac1b7a2fd5e
|
File details
Details for the file mlx_openai_server-1.5.2-py3-none-any.whl.
File metadata
- Download URL: mlx_openai_server-1.5.2-py3-none-any.whl
- Upload date:
- Size: 104.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2ee90ab25b4db5973d04f2dd0642fd40d57483eb738f2a86e5a46b30c6f8bb3
|
|
| MD5 |
a17ed509f86a2d1f3c7d5f90fa097283
|
|
| BLAKE2b-256 |
35711d68e7b6f8c7b5bb38c8578f417df3fc227f3824bd29251f6690b8f39820
|