Skip to main content

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

Project description

vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

License Python 3.10+ Apple Silicon GitHub

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

  • MLX: Apple's ML framework with unified memory and Metal kernels
  • mlx-lm: Optimized LLM inference with KV cache and quantization
  • mlx-vlm: Vision-language models for multimodal inference
  • mlx-audio: Speech-to-Text and Text-to-Speech with native voices
  • mlx-embeddings: Text embeddings for semantic search and RAG

Features

  • Multimodal - Text, Image, Video & Audio in one platform
  • Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
  • Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
  • OpenAI API compatible - drop-in replacement for OpenAI client
  • Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
  • Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
  • Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
  • MCP Tool Calling - integrate external tools via Model Context Protocol
  • Paged KV Cache - memory-efficient caching with prefix sharing
  • Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model Languages Description
Kokoro EN, ES, FR, JA, ZH, IT, PT, HI Fast, 82M params, 11 voices
Chatterbox 15+ languages Expressive, voice cloning
VibeVoice EN Realtime, low latency
VoxCPM ZH, EN High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser Models Description
qwen3 Qwen3 series Requires both <think> and </think> tags
deepseek_r1 DeepSeek-R1 Handles implicit <think> tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model Speed Memory
Qwen3-0.6B-8bit 402 tok/s 0.7 GB
Llama-3.2-1B-4bit 464 tok/s 0.7 GB
Llama-3.2-3B-4bit 200 tok/s 1.8 GB

Continuous Batching (5 concurrent requests):

Model Single Batched Speedup
Qwen3-0.6B-8bit 328 tok/s 1112 tok/s 3.4x
Llama-3.2-1B-4bit 299 tok/s 613 tok/s 2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model RTF* Use Case
whisper-tiny 197x Real-time, low latency
whisper-large-v3-turbo 55x Best quality/speed balance
whisper-large-v3 24x Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting Max Context Memory
Default (1024) ~10K tokens ~16GB
GEMMA3_SLIDING_WINDOW=8192 ~40K tokens ~25GB
GEMMA3_SLIDING_WINDOW=0 ~50K tokens ~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements
  • Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_mlx-0.2.9.tar.gz (598.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_mlx-0.2.9-py3-none-any.whl (474.2 kB view details)

Uploaded Python 3

File details

Details for the file vllm_mlx-0.2.9.tar.gz.

File metadata

  • Download URL: vllm_mlx-0.2.9.tar.gz
  • Upload date:
  • Size: 598.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vllm_mlx-0.2.9.tar.gz
Algorithm Hash digest
SHA256 9fbd81ce56f54be83ec99860c7c87405e5c5a31a0aadd55cc0f192141212740c
MD5 034bdde8c84ab1450fcd4c25da0bd241
BLAKE2b-256 33040b9da3d36bfa2ba2fd38dbca7827a6c5a14de0e3d1d65d7cac1925d994c3

See more details on using hashes here.

File details

Details for the file vllm_mlx-0.2.9-py3-none-any.whl.

File metadata

  • Download URL: vllm_mlx-0.2.9-py3-none-any.whl
  • Upload date:
  • Size: 474.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vllm_mlx-0.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 0894e0c25ea1e850b7ca99dee5c82748bd49b44c1315a9c277bf90a662db5ef3
MD5 83074ab9d91a490c7397ce711d6694c8
BLAKE2b-256 ecb4a6b1f97ad8bd77f03b47e8fd867228f32b369df950963ede891ecf627dca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page