Local inference server for Apple Silicon that hot-swaps MLX models (LLM, vision, embeddings, TTS, STT) via OpenAI-compatible API
Project description
mlx-serve
Local inference server for Apple Silicon that hot-swaps MLX models on demand — text, vision, embeddings, TTS, and STT — loading exactly one at a time to stay within unified memory limits.
Client / LiteLLM --> mlx-serve (port 8095) --> MLX model (one at a time)
Install
pip install mlx-serve[all]
# or pick only what you need:
pip install mlx-serve[text,vision]
pip install mlx-serve[embeddings,tts,stt]
Requires: Apple Silicon Mac (M1+), macOS 13+, Python 3.11+
Quick Start
# 1. Generate a default config
mlx-serve init
# 2. Edit models.yaml to list your models (see docs/configuration.md)
# 3. Start the server
mlx-serve start
# 4. Verify
curl http://localhost:8095/v1/models
Why mlx-serve?
| mlx-serve | Ollama | LM Studio | mlx-openai-server | |
|---|---|---|---|---|
| Runtime | MLX (native Apple) | llama.cpp (Metal) | Mixed | MLX |
| Memory model | One model, subprocess-isolated | One model, in-process | GUI-managed | In-process |
| Auto-unload | Configurable timeout | Yes | Manual | No |
| Model types | 5 (text, vision, embed, TTS, STT) | 1 (text) | ~2 | ~3 |
| API | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible |
| Headless / scriptable | Yes | Yes | No (GUI) | Yes |
| Open source | MIT | MIT | No | MIT |
Key differences:
- vs Ollama — Ollama uses llama.cpp. mlx-serve uses Apple's native MLX framework, which typically achieves better throughput and memory efficiency on Apple Silicon. mlx-serve is what Ollama would be if it were built natively on MLX.
- vs LM Studio — Closed source, requires a GUI, cannot be embedded in headless pipelines.
- vs mlx-openai-server — Runs all models in-process, causing memory fragmentation over long sessions. mlx-serve isolates text/vision models as subprocesses so the OS reclaims all memory cleanly on unload.
- vs Docker — MLX requires direct Metal GPU access. Docker on Mac runs a Linux VM without Metal. The correct topology: stateless services in Docker, mlx-serve on the Mac host via
host.docker.internal.
Features
- Hot-swap by model name — send a request to any configured model; the server loads it and unloads the previous one automatically
- OpenAI-compatible API — drop-in with LiteLLM, any OpenAI SDK, or direct HTTP
- All five MLX model types — text (
mlx-lm), vision (mlx-vlm), embeddings (mlx-embeddings), TTS (mlx-audio), STT (mlx-whisper) - Subprocess isolation — text/vision models run as isolated subprocesses; embeddings/TTS/STT run in-process
- Auto-unload on inactivity — configurable timeout (default 10 min) frees memory when idle
- Per-request
keep_alive— override the idle timeout per request ("keep_alive": "30m","-1"for permanent,0to unload immediately) - Prompt caching —
max_kv_cache_sizeper model caps KV cache token capacity for efficient prefix reuse - Model management API — preload, force-unload, delete from disk, show detail, pull from HuggingFace
- Observability — request metrics (TTFT, TPS, latency), memory monitoring, lifecycle event log, dashboard endpoint
- Optional auth — set
MLX_API_KEYto protect all/v1/*endpoints - YAML config — add models by editing
models.yaml, no code changes needed - CLI —
mlx-serve init,start,stop,status,logs
Supported Model Types
| Type | Backend | Endpoint | Capabilities |
|---|---|---|---|
text |
mlx_lm.server subprocess |
/v1/chat/completions |
["completion"] |
vision |
mlx_vlm.server subprocess |
/v1/chat/completions |
["completion", "vision"] |
embedding |
mlx-embeddings in-process |
/v1/embeddings |
["embedding"] |
tts |
mlx-audio in-process |
/v1/audio/speech |
["audio_speech"] |
stt |
mlx-whisper in-process |
/v1/audio/transcriptions |
["audio_transcription"] |
Usage
Chat completion
curl http://localhost:8095/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-qwen2.5-7b",
"messages": [{"role": "user", "content": "What is Apple Silicon?"}]
}'
Streaming
curl http://localhost:8095/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-qwen2.5-7b",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
Embeddings
curl http://localhost:8095/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "mlx-qwen3-embedding", "input": "Hello world"}'
Text-to-speech
curl http://localhost:8095/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "mlx-chatterbox", "input": "Hello from Apple Silicon."}' \
--output speech.wav
Speech-to-text
curl http://localhost:8095/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=mlx-whisper-turbo"
LiteLLM Integration
mlx-serve is designed to sit behind LiteLLM in a Docker-on-Mac stack.
# litellm/config.yaml
model_list:
- model_name: mlx-qwen2.5-7b
litellm_params:
model: openai/mlx-qwen2.5-7b
api_base: http://host.docker.internal:8095/v1
api_key: none
- model_name: mlx-qwen3-embedding
litellm_params:
model: openai/mlx-qwen3-embedding
api_base: http://host.docker.internal:8095/v1
api_key: none
Development
git clone https://github.com/raspoli/mlx-serve.git
cd mlx-serve
make install # uv sync with all extras
make dev # start with auto-reload
make test # run test suite
make lint # ruff check + format check
See docs/development.md for the full guide.
Documentation
| Document | Contents |
|---|---|
| docs/architecture.md | System design, module map, state machines, request flows |
| docs/configuration.md | models.yaml complete reference, all settings |
| docs/api.md | All endpoints, request/response schemas, curl examples |
| docs/development.md | Setup, debugging, adding models, contributing |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_serve-0.1.0.tar.gz.
File metadata
- Download URL: mlx_serve-0.1.0.tar.gz
- Upload date:
- Size: 268.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fbe0a35639aec6ba7cc09e145a325aeeb5e266571a6ce775e14ba4829d590c3
|
|
| MD5 |
10e8987ab60c1a7bb87cc4ae87d8ba8d
|
|
| BLAKE2b-256 |
844e08005c44bb1696061bbd8a0d9940a52f04cab2e07a1c477999649ed426fc
|
File details
Details for the file mlx_serve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mlx_serve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92571cfc3f0f8a668174a7c71f6a3f42789686b07a383aa81a016a561bc132eb
|
|
| MD5 |
853a266e04ba74b2b53bb8ce7e1a6b8c
|
|
| BLAKE2b-256 |
2927ac7f8f35884252f4aa435e1b5f891abeddabf3abd7f8f0482c7873ce3bb7
|