vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
Project description
vLLM-MLX
vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
Overview
vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:
- MLX: Apple's ML framework with unified memory and Metal kernels
- mlx-lm: Optimized LLM inference with KV cache and quantization
- mlx-vlm: Vision-language models for multimodal inference
- mlx-audio: Speech-to-Text and Text-to-Speech with native voices
- mlx-embeddings: Text embeddings for semantic search and RAG
Features
- Multimodal - Text, Image, Video & Audio in one platform
- Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
- Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
- OpenAI API compatible - drop-in replacement for OpenAI client
- Anthropic Messages API - native
/v1/messagesendpoint for Claude Code and OpenCode - Embeddings - OpenAI-compatible
/v1/embeddingsendpoint with mlx-embeddings - Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
- MCP Tool Calling - integrate external tools via Model Context Protocol
- Paged KV Cache - memory-efficient caching with prefix sharing
- Continuous Batching - high throughput for multiple concurrent users
Quick Start
Installation
Using uv (recommended):
# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git
Using pip:
# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git
# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
Start Server
# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key
Use with OpenAI SDK
from openai import OpenAI
# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Use with Anthropic SDK
vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")
response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)
To use with Claude Code:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.
Multimodal (Images & Video)
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
)
Audio (TTS/STT)
# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng # macOS, for non-English languages
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play
# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play
# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages
Supported TTS Models:
| Model | Languages | Description |
|---|---|---|
| Kokoro | EN, ES, FR, JA, ZH, IT, PT, HI | Fast, 82M params, 11 voices |
| Chatterbox | 15+ languages | Expressive, voice cloning |
| VibeVoice | EN | Realtime, low latency |
| VoxCPM | ZH, EN | High quality Chinese/English |
Reasoning Models
Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:
# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 × 23?"}]
)
# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)
Supported Parsers:
| Parser | Models | Description |
|---|---|---|
qwen3 |
Qwen3 series | Requires both <think> and </think> tags |
deepseek_r1 |
DeepSeek-R1 | Handles implicit <think> tag |
Embeddings
Generate text embeddings for semantic search, RAG, and similarity:
# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
model="mlx-community/all-MiniLM-L6-v2-4bit",
input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")
See Embeddings Guide for details on supported models and lazy loading.
Documentation
For full documentation, see the docs directory:
-
Getting Started
-
User Guides
-
Reference
-
Benchmarks
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ vLLM API Layer │
│ (OpenAI-compatible interface) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLXPlatform │
│ (vLLM platform plugin for Apple Silicon) │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
└─────────────┴─────────────────────────┴─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX │
│ (Apple ML Framework - Metal kernels) │
└─────────────────────────────────────────────────────────────────────────┘
Performance
LLM Performance (M4 Max, 128GB):
| Model | Speed | Memory |
|---|---|---|
| Qwen3-0.6B-8bit | 402 tok/s | 0.7 GB |
| Llama-3.2-1B-4bit | 464 tok/s | 0.7 GB |
| Llama-3.2-3B-4bit | 200 tok/s | 1.8 GB |
Continuous Batching (5 concurrent requests):
| Model | Single | Batched | Speedup |
|---|---|---|---|
| Qwen3-0.6B-8bit | 328 tok/s | 1112 tok/s | 3.4x |
| Llama-3.2-1B-4bit | 299 tok/s | 613 tok/s | 2.0x |
Audio - Speech-to-Text (M4 Max, 128GB):
| Model | RTF* | Use Case |
|---|---|---|
| whisper-tiny | 197x | Real-time, low latency |
| whisper-large-v3-turbo | 55x | Best quality/speed balance |
| whisper-large-v3 | 24x | Highest accuracy |
*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.
See benchmarks for detailed results.
Gemma 3 Support
vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.
Usage
# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"
Long Context Patch (mlx-vlm)
Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:
Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py
Find the make_cache method and replace with:
def make_cache(self):
import os
# Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
# Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))
caches = []
for i in range(self.config.num_hidden_layers):
if (
i % self.config.sliding_window_pattern
== self.config.sliding_window_pattern - 1
):
caches.append(KVCache())
elif sliding_window == 0:
caches.append(KVCache()) # Full context for all layers
else:
caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
return caches
Usage:
# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
Benchmark Results (M4 Max 128GB):
| Setting | Max Context | Memory |
|---|---|---|
| Default (1024) | ~10K tokens | ~16GB |
GEMMA3_SLIDING_WINDOW=8192 |
~40K tokens | ~25GB |
GEMMA3_SLIDING_WINDOW=0 |
~50K tokens | ~35GB |
Contributing
We welcome contributions! See Contributing Guide for details.
- Bug fixes and improvements
- Performance optimizations
- Documentation improvements
- Benchmarks on different Apple Silicon chips
Submit PRs to: https://github.com/waybarrios/vllm-mlx
License
Apache 2.0 - see LICENSE for details.
Citation
If you use vLLM-MLX in your research or project, please cite:
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_mlx-0.2.7.tar.gz.
File metadata
- Download URL: vllm_mlx-0.2.7.tar.gz
- Upload date:
- Size: 369.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8cb9ea198afa45a486b0e76295c9f12c4d4d01893ed79da7103bfb6d3c2bf00
|
|
| MD5 |
d7fe68a87b76ce827fe330e9aedeea84
|
|
| BLAKE2b-256 |
a399bad41fa9ca1eeb806e3b0733405f20b10eb299868389d5c29e03f33c000f
|
File details
Details for the file vllm_mlx-0.2.7-py3-none-any.whl.
File metadata
- Download URL: vllm_mlx-0.2.7-py3-none-any.whl
- Upload date:
- Size: 307.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e8209bddc075e05aed23e554a24109276f28af63938f9be5581a1f3ec938d69
|
|
| MD5 |
2fdb7d833f518d56040c0d7b86ffc723
|
|
| BLAKE2b-256 |
89474b6e59b0771dfd6b542c727fc75701c95ca4bcc20f61f248da933831bc57
|