Skip to main content

A high-performance API server that provides OpenAI-compatible endpoints for MLX models.

Project description

mlx-openai-server

MIT License Python 3.11

A high-performance OpenAI-compatible API server for MLX models. Run text, vision, audio, and image generation models locally on Apple Silicon with a drop-in OpenAI replacement.

Note: Requires macOS with M-series chips (MLX is optimized for Apple Silicon).

Key Features

  • 🚀 OpenAI-compatible API - Drop-in replacement for OpenAI services
  • 🖼️ Multimodal support - Text, vision, audio, and image generation/editing
  • 🎨 Flux-series models - Image generation (schnell, dev, krea-dev, flux-2-klein) and editing (kontext, qwen-image-edit)
  • 🔌 Easy integration - Works with existing OpenAI client libraries
  • Performance - Configurable quantization (4/8/16-bit) and context length
  • 🎛️ LoRA adapters - Fine-tuned image generation and editing
  • 📈 Queue management - Built-in request queuing and monitoring

Installation

Prerequisites

  • macOS with Apple Silicon (M-series)
  • Python 3.11+

Quick Install

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install from PyPI
pip install mlx-openai-server

# Or install from GitHub
pip install git+https://github.com/cubist38/mlx-openai-server.git

Optional: Whisper Support

For audio transcription models, install ffmpeg:

brew install ffmpeg

Quick Start

Start the Server

# Text-only or multimodal models
mlx-openai-server launch \
  --model-path <path-to-mlx-model> \
  --model-type <lm|multimodal>

# Image generation (Flux-series)
mlx-openai-server launch \
  --model-type image-generation \
  --model-path <path-to-flux-model> \
  --config-name flux-dev \
  --quantize 8

# Image editing
mlx-openai-server launch \
  --model-type image-edit \
  --model-path <path-to-flux-model> \
  --config-name flux-kontext-dev \
  --quantize 8

# Embeddings
mlx-openai-server launch \
  --model-type embeddings \
  --model-path <embeddings-model-path>

# Whisper (audio transcription)
mlx-openai-server launch \
  --model-type whisper \
  --model-path mlx-community/whisper-large-v3-mlx

Server Parameters

  • --model-path: Path to MLX model (local or HuggingFace repo)
  • --model-type: lm, multimodal, image-generation, image-edit, embeddings, or whisper
  • --config-name: For image models - flux-schnell, flux-dev, flux-krea-dev, flux-kontext-dev, flux2-klein-4b, flux2-klein-9b, qwen-image, qwen-image-edit, z-image-turbo, fibo
  • --quantize: Quantization level - 4, 8, or 16 (image models)
  • --context-length: Max sequence length for memory optimization
  • --max-concurrency: Concurrent requests (default: 1)
  • --queue-timeout: Request timeout in seconds (default: 300)
  • --lora-paths: Comma-separated LoRA adapter paths (image models)
  • --lora-scales: Comma-separated LoRA scales (must match paths)
  • --log-level: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
  • --no-log-file: Disable file logging (console only)

Supported Model Types

  1. Text-only (lm) - Language models via mlx-lm
  2. Multimodal (multimodal) - Text, images, audio via mlx-vlm
  3. Image generation (image-generation) - Flux-series, Qwen Image, Z-Image Turbo, Fibo
  4. Image editing (image-edit) - Flux kontext, Qwen Image Edit
  5. Embeddings (embeddings) - Text embeddings via mlx-embeddings
  6. Whisper (whisper) - Audio transcription (requires ffmpeg)

Image Model Configurations

Generation:

  • flux-schnell - Fast (4 steps, no guidance)
  • flux-dev - Balanced (25 steps, 3.5 guidance)
  • flux-krea-dev - High quality (28 steps, 4.5 guidance)
  • flux2-klein-4b / flux2-klein-9b - Flux 2 Klein models
  • qwen-image - Qwen image generation (50 steps, 4.0 guidance)
  • z-image-turbo - Z-Image Turbo
  • fibo - Fibo model

Editing:

  • flux-kontext-dev - Context-aware editing (28 steps, 2.5 guidance)
  • flux2-klein-edit-4b / flux2-klein-edit-9b - Flux 2 Klein editing
  • qwen-image-edit - Qwen image editing (50 steps, 4.0 guidance)

Using the API

The server provides OpenAI-compatible endpoints. Use standard OpenAI client libraries:

Text Completion

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.choices[0].message.content)

Vision (Multimodal)

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.jpg", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode('utf-8')

response = client.chat.completions.create(
    model="local-multimodal",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        ]
    }]
)
print(response.choices[0].message.content)

Image Generation

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.images.generate(
    prompt="A serene landscape with mountains and a lake at sunset",
    model="local-image-generation-model",
    size="1024x1024"
)

image_data = base64.b64decode(response.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Image Editing

import openai
import base64
from io import BytesIO
from PIL import Image

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("image.png", "rb") as f:
    result = client.images.edit(
        image=f,
        prompt="make it like a photo in 1800s",
        model="flux-kontext-dev"
    )

image_data = base64.b64decode(result.data[0].b64_json)
image = Image.open(BytesIO(image_data))
image.show()

Function Calling

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

messages = [{"role": "user", "content": "What is the weather in Tokyo?"}]
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather in a given city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"}
            }
        }
    }
}]

completion = client.chat.completions.create(
    model="local-model",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

if completion.choices[0].message.tool_calls:
    tool_call = completion.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Embeddings

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.embeddings.create(
    model="local-model",
    input=["The quick brown fox jumps over the lazy dog"]
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")

Structured Outputs (JSON Schema)

import openai
import json

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "Address",
        "schema": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "city": {"type": "string"},
                "state": {"type": "string"},
                "zip": {"type": "string"}
            },
            "required": ["street", "city", "state", "zip"]
        }
    }
}

completion = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Format: 1 Hacker Wy Menlo Park CA 94025"}],
    response_format=response_format
)

address = json.loads(completion.choices[0].message.content)
print(json.dumps(address, indent=2))

Advanced Configuration

Parser Configuration

For models requiring custom parsing (tool calls, reasoning):

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --tool-call-parser qwen3 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice

Available parsers: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2

Message Converters

For models requiring message format conversion:

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --message-converter glm4_moe

Available converters: glm4_moe, minimax_m2, nemotron3_nano, qwen3_coder

Custom Chat Templates

mlx-openai-server launch \
  --model-path <path-to-model> \
  --model-type lm \
  --chat-template-file /path/to/template.jinja

Request Queue System

The server includes a request queue system with monitoring:

# Check queue status
curl http://localhost:8000/v1/queue/stats

Response:

{
  "status": "ok",
  "queue_stats": {
    "running": true,
    "queue_size": 3,
    "max_queue_size": 100,
    "active_requests": 1,
    "max_concurrency": 1
  }
}

Example Notebooks

Check the examples/ directory for comprehensive guides:

  • audio_examples.ipynb - Audio processing
  • embedding_examples.ipynb - Text embeddings
  • lm_embeddings_examples.ipynb - Language model embeddings
  • vlm_embeddings_examples.ipynb - Vision-language embeddings
  • vision_examples.ipynb - Vision capabilities
  • image_generations.ipynb - Image generation
  • image_edit.ipynb - Image editing
  • structured_outputs_examples.ipynb - JSON schema outputs
  • simple_rag_demo.ipynb - RAG pipeline demo

Large Models

For models that don't fit in RAM, improve performance on macOS 15.0+:

bash configure_mlx.sh

This raises the system's wired memory limit for better performance.

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

Follow Conventional Commits for commit messages.

Support

License

MIT License - see LICENSE file for details.

Acknowledgments

Built on top of:


GitHub stars

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_openai_server-1.5.2.tar.gz (88.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_openai_server-1.5.2-py3-none-any.whl (104.1 kB view details)

Uploaded Python 3

File details

Details for the file mlx_openai_server-1.5.2.tar.gz.

File metadata

  • Download URL: mlx_openai_server-1.5.2.tar.gz
  • Upload date:
  • Size: 88.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mlx_openai_server-1.5.2.tar.gz
Algorithm Hash digest
SHA256 b0aecbc80646f7ee3837891313712259f22d3d1c95fa3c8ce73cb74bd596a976
MD5 25debd6c69b113244afd5b9d59363c7d
BLAKE2b-256 2458ac30499cfbed3be8e08b9b6316cbb02a8b2e84c1ebdd11105ac1b7a2fd5e

See more details on using hashes here.

File details

Details for the file mlx_openai_server-1.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mlx_openai_server-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f2ee90ab25b4db5973d04f2dd0642fd40d57483eb738f2a86e5a46b30c6f8bb3
MD5 a17ed509f86a2d1f3c7d5f90fa097283
BLAKE2b-256 35711d68e7b6f8c7b5bb38c8578f417df3fc227f3824bd29251f6690b8f39820

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page