Local inference server for Apple Silicon that hot-swaps MLX models (LLM, vision, embeddings, TTS, STT) via OpenAI-compatible API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raspoli

These details have not been verified by PyPI

Project description

mlx-serve

Local inference server for Apple Silicon that hot-swaps MLX models on demand — text, vision, embeddings, TTS, and STT — loading exactly one at a time to stay within unified memory limits.

Client / LiteLLM  -->  mlx-serve (port 8095)  -->  MLX model (one at a time)

Install

pip install mlx-serve[all]

# or pick only what you need:
pip install mlx-serve[text,vision]
pip install mlx-serve[embeddings,tts,stt]

Requires: Apple Silicon Mac (M1+), macOS 13+, Python 3.11+

Quick Start

# 1. Generate a default config
mlx-serve init

# 2. Edit models.yaml to list your models (see docs/configuration.md)

# 3. Start the server
mlx-serve start

# 4. Verify
curl http://localhost:8095/v1/models

Why mlx-serve?

	mlx-serve	Ollama	LM Studio	mlx-openai-server
Runtime	MLX (native Apple)	llama.cpp (Metal)	Mixed	MLX
Memory model	One model, subprocess-isolated	One model, in-process	GUI-managed	In-process
Auto-unload	Configurable timeout	Yes	Manual	No
Model types	5 (text, vision, embed, TTS, STT)	1 (text)	~2	~3
API	OpenAI-compatible	OpenAI-compatible	OpenAI-compatible	OpenAI-compatible
Headless / scriptable	Yes	Yes	No (GUI)	Yes
Open source	MIT	MIT	No	MIT

Key differences:

vs Ollama — Ollama uses llama.cpp. mlx-serve uses Apple's native MLX framework, which typically achieves better throughput and memory efficiency on Apple Silicon. mlx-serve is what Ollama would be if it were built natively on MLX.
vs LM Studio — Closed source, requires a GUI, cannot be embedded in headless pipelines.
vs mlx-openai-server — Runs all models in-process, causing memory fragmentation over long sessions. mlx-serve isolates text/vision models as subprocesses so the OS reclaims all memory cleanly on unload.
vs Docker — MLX requires direct Metal GPU access. Docker on Mac runs a Linux VM without Metal. The correct topology: stateless services in Docker, mlx-serve on the Mac host via host.docker.internal.

Features

Hot-swap by model name — send a request to any configured model; the server loads it and unloads the previous one automatically
OpenAI-compatible API — drop-in with LiteLLM, any OpenAI SDK, or direct HTTP
All five MLX model types — text (mlx-lm), vision (mlx-vlm), embeddings (mlx-embeddings), TTS (mlx-audio), STT (mlx-whisper)
Subprocess isolation — text/vision models run as isolated subprocesses; embeddings/TTS/STT run in-process
Auto-unload on inactivity — configurable timeout (default 10 min) frees memory when idle
Per-request keep_alive — override the idle timeout per request ("keep_alive": "30m", "-1" for permanent, 0 to unload immediately)
Prompt caching — max_kv_cache_size per model caps KV cache token capacity for efficient prefix reuse
Model management API — preload, force-unload, delete from disk, show detail, pull from HuggingFace
Observability — request metrics (TTFT, TPS, latency), memory monitoring, lifecycle event log, dashboard endpoint
Optional auth — set MLX_API_KEY to protect all /v1/* endpoints
YAML config — add models by editing models.yaml, no code changes needed
CLI — mlx-serve init, start, stop, status, logs

Supported Model Types

Type	Backend	Endpoint	Capabilities
`text`	`mlx_lm.server` subprocess	`/v1/chat/completions`	`["completion"]`
`vision`	`mlx_vlm.server` subprocess	`/v1/chat/completions`	`["completion", "vision"]`
`embedding`	`mlx-embeddings` in-process	`/v1/embeddings`	`["embedding"]`
`tts`	`mlx-audio` in-process	`/v1/audio/speech`	`["audio_speech"]`
`stt`	`mlx-whisper` in-process	`/v1/audio/transcriptions`	`["audio_transcription"]`

Usage

Chat completion

curl http://localhost:8095/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-qwen2.5-7b",
    "messages": [{"role": "user", "content": "What is Apple Silicon?"}]
  }'

Streaming

curl http://localhost:8095/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-qwen2.5-7b",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Embeddings

curl http://localhost:8095/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-qwen3-embedding", "input": "Hello world"}'

Text-to-speech

curl http://localhost:8095/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-chatterbox", "input": "Hello from Apple Silicon."}' \
  --output speech.wav

Speech-to-text

curl http://localhost:8095/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=mlx-whisper-turbo"

LiteLLM Integration

mlx-serve is designed to sit behind LiteLLM in a Docker-on-Mac stack.

# litellm/config.yaml
model_list:
  - model_name: mlx-qwen2.5-7b
    litellm_params:
      model: openai/mlx-qwen2.5-7b
      api_base: http://host.docker.internal:8095/v1
      api_key: none

  - model_name: mlx-qwen3-embedding
    litellm_params:
      model: openai/mlx-qwen3-embedding
      api_base: http://host.docker.internal:8095/v1
      api_key: none

Development

git clone https://github.com/raspoli/mlx-serve.git
cd mlx-serve
make install    # uv sync with all extras
make dev        # start with auto-reload
make test       # run test suite
make lint       # ruff check + format check

See docs/development.md for the full guide.

Documentation

Document	Contents
docs/architecture.md	System design, module map, state machines, request flows
docs/configuration.md	`models.yaml` complete reference, all settings
docs/api.md	All endpoints, request/response schemas, curl examples
docs/development.md	Setup, debugging, adding models, contributing

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raspoli

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_serve-0.1.0.tar.gz (268.6 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_serve-0.1.0-py3-none-any.whl (32.3 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file mlx_serve-0.1.0.tar.gz.

File metadata

Download URL: mlx_serve-0.1.0.tar.gz
Upload date: Mar 31, 2026
Size: 268.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx_serve-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4fbe0a35639aec6ba7cc09e145a325aeeb5e266571a6ce775e14ba4829d590c3`
MD5	`10e8987ab60c1a7bb87cc4ae87d8ba8d`
BLAKE2b-256	`844e08005c44bb1696061bbd8a0d9940a52f04cab2e07a1c477999649ed426fc`

See more details on using hashes here.

File details

Details for the file mlx_serve-0.1.0-py3-none-any.whl.

File metadata

Download URL: mlx_serve-0.1.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mlx_serve-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92571cfc3f0f8a668174a7c71f6a3f42789686b07a383aa81a016a561bc132eb`
MD5	`853a266e04ba74b2b53bb8ce7e1a6b8c`
BLAKE2b-256	`2927ac7f8f35884252f4aa435e1b5f891abeddabf3abd7f8f0482c7873ce3bb7`

See more details on using hashes here.

mlx-serve 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

mlx-serve

Install

Quick Start

Why mlx-serve?

Features

Supported Model Types

Usage

Chat completion

Streaming

Embeddings

Text-to-speech

Speech-to-text

LiteLLM Integration

Development

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes