Skip to main content

Local inference server for Apple Silicon that hot-swaps MLX models (LLM, vision, embeddings, TTS, STT) via OpenAI-compatible API

Project description

mlx-serve

PyPI version Python versions License CI Platform

Local inference server for Apple Silicon that hot-swaps MLX models on demand — text, vision, embeddings, TTS, and STT — loading exactly one at a time to stay within unified memory limits.

Client / LiteLLM  -->  mlx-serve (port 8095)  -->  MLX model (one at a time)

Install

pip install mlx-serve[all]

# or pick only what you need:
pip install mlx-serve[text,vision]
pip install mlx-serve[embeddings,tts,stt]

Requires: Apple Silicon Mac (M1+), macOS 13+, Python 3.11+


Quick Start

# 1. Generate a default config
mlx-serve init

# 2. Edit models.yaml to list your models (see docs/configuration.md)

# 3. Start the server
mlx-serve start

# 4. Verify
curl http://localhost:8095/v1/models

Why mlx-serve?

mlx-serve Ollama LM Studio mlx-openai-server
Runtime MLX (native Apple) llama.cpp (Metal) Mixed MLX
Memory model One model, subprocess-isolated One model, in-process GUI-managed In-process
Auto-unload Configurable timeout Yes Manual No
Model types 5 (text, vision, embed, TTS, STT) 1 (text) ~2 ~3
API OpenAI-compatible OpenAI-compatible OpenAI-compatible OpenAI-compatible
Headless / scriptable Yes Yes No (GUI) Yes
Open source MIT MIT No MIT

Key differences:

  • vs Ollama — Ollama uses llama.cpp. mlx-serve uses Apple's native MLX framework, which typically achieves better throughput and memory efficiency on Apple Silicon. mlx-serve is what Ollama would be if it were built natively on MLX.
  • vs LM Studio — Closed source, requires a GUI, cannot be embedded in headless pipelines.
  • vs mlx-openai-server — Runs all models in-process, causing memory fragmentation over long sessions. mlx-serve isolates text/vision models as subprocesses so the OS reclaims all memory cleanly on unload.
  • vs Docker — MLX requires direct Metal GPU access. Docker on Mac runs a Linux VM without Metal. The correct topology: stateless services in Docker, mlx-serve on the Mac host via host.docker.internal.

Features

  • Hot-swap by model name — send a request to any configured model; the server loads it and unloads the previous one automatically
  • OpenAI-compatible API — drop-in with LiteLLM, any OpenAI SDK, or direct HTTP
  • All five MLX model types — text (mlx-lm), vision (mlx-vlm), embeddings (mlx-embeddings), TTS (mlx-audio), STT (mlx-whisper)
  • Subprocess isolation — text/vision models run as isolated subprocesses; embeddings/TTS/STT run in-process
  • Auto-unload on inactivity — configurable timeout (default 10 min) frees memory when idle
  • Per-request keep_alive — override the idle timeout per request ("keep_alive": "30m", "-1" for permanent, 0 to unload immediately)
  • Prompt cachingmax_kv_cache_size per model caps KV cache token capacity for efficient prefix reuse
  • Model management API — preload, force-unload, delete from disk, show detail, pull from HuggingFace
  • Observability — request metrics (TTFT, TPS, latency), memory monitoring, lifecycle event log, dashboard endpoint
  • Optional auth — set MLX_API_KEY to protect all /v1/* endpoints
  • YAML config — add models by editing models.yaml, no code changes needed
  • CLImlx-serve init, start, stop, status, logs

Supported Model Types

Type Backend Endpoint Capabilities
text mlx_lm.server subprocess /v1/chat/completions ["completion"]
vision mlx_vlm.server subprocess /v1/chat/completions ["completion", "vision"]
embedding mlx-embeddings in-process /v1/embeddings ["embedding"]
tts mlx-audio in-process /v1/audio/speech ["audio_speech"]
stt mlx-whisper in-process /v1/audio/transcriptions ["audio_transcription"]

Usage

Chat completion

curl http://localhost:8095/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-qwen2.5-7b",
    "messages": [{"role": "user", "content": "What is Apple Silicon?"}]
  }'

Streaming

curl http://localhost:8095/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-qwen2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Embeddings

curl http://localhost:8095/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-qwen3-embedding", "input": "Hello world"}'

Text-to-speech

curl http://localhost:8095/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-chatterbox", "input": "Hello from Apple Silicon."}' \
  --output speech.wav

Speech-to-text

curl http://localhost:8095/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=mlx-whisper-turbo"

LiteLLM Integration

mlx-serve is designed to sit behind LiteLLM in a Docker-on-Mac stack.

# litellm/config.yaml
model_list:
  - model_name: mlx-qwen2.5-7b
    litellm_params:
      model: openai/mlx-qwen2.5-7b
      api_base: http://host.docker.internal:8095/v1
      api_key: none

  - model_name: mlx-qwen3-embedding
    litellm_params:
      model: openai/mlx-qwen3-embedding
      api_base: http://host.docker.internal:8095/v1
      api_key: none

Development

git clone https://github.com/raspoli/mlx-serve.git
cd mlx-serve
make install    # uv sync with all extras
make dev        # start with auto-reload
make test       # run test suite
make lint       # ruff check + format check

See docs/development.md for the full guide.


Documentation

Document Contents
docs/architecture.md System design, module map, state machines, request flows
docs/configuration.md models.yaml complete reference, all settings
docs/api.md All endpoints, request/response schemas, curl examples
docs/development.md Setup, debugging, adding models, contributing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_serve-0.1.0.tar.gz (268.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_serve-0.1.0-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file mlx_serve-0.1.0.tar.gz.

File metadata

  • Download URL: mlx_serve-0.1.0.tar.gz
  • Upload date:
  • Size: 268.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx_serve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4fbe0a35639aec6ba7cc09e145a325aeeb5e266571a6ce775e14ba4829d590c3
MD5 10e8987ab60c1a7bb87cc4ae87d8ba8d
BLAKE2b-256 844e08005c44bb1696061bbd8a0d9940a52f04cab2e07a1c477999649ed426fc

See more details on using hashes here.

File details

Details for the file mlx_serve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_serve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx_serve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 92571cfc3f0f8a668174a7c71f6a3f42789686b07a383aa81a016a561bc132eb
MD5 853a266e04ba74b2b53bb8ce7e1a6b8c
BLAKE2b-256 2927ac7f8f35884252f4aa435e1b5f891abeddabf3abd7f8f0482c7873ce3bb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page