vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

These details have not been verified by PyPI

Project links

Project description

vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

MLX: Apple's ML framework with unified memory and Metal kernels
mlx-lm: Optimized LLM inference with KV cache and quantization
mlx-vlm: Vision-language models for multimodal inference
mlx-audio: Speech-to-Text and Text-to-Speech with native voices
mlx-embeddings: Text embeddings for semantic search and RAG

Features

Multimodal - Text, Image, Video & Audio in one platform
Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
OpenAI API compatible - drop-in replacement for OpenAI client
Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
MCP Tool Calling - integrate external tools via Model Context Protocol
Paged KV Cache - memory-efficient caching with prefix sharing
Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages

# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model	Languages	Description
Kokoro	EN, ES, FR, JA, ZH, IT, PT, HI	Fast, 82M params, 11 voices
Chatterbox	15+ languages	Expressive, voice cloning
VibeVoice	EN	Realtime, low latency
VoxCPM	ZH, EN	High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser	Models	Description
`qwen3`	Qwen3 series	Requires both `<think>` and `</think>` tags
`deepseek_r1`	DeepSeek-R1	Handles implicit `<think>` tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

Getting Started
- Installation
- Quick Start
User Guides
Reference
Benchmarks

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model	Speed	Memory
Qwen3-0.6B-8bit	402 tok/s	0.7 GB
Llama-3.2-1B-4bit	464 tok/s	0.7 GB
Llama-3.2-3B-4bit	200 tok/s	1.8 GB

Continuous Batching (5 concurrent requests):

Model	Single	Batched	Speedup
Qwen3-0.6B-8bit	328 tok/s	1112 tok/s	3.4x
Llama-3.2-1B-4bit	299 tok/s	613 tok/s	2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model	RTF*	Use Case
whisper-tiny	197x	Real-time, low latency
whisper-large-v3-turbo	55x	Best quality/speed balance
whisper-large-v3	24x	Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting	Max Context	Memory
Default (1024)	~10K tokens	~16GB
`GEMMA3_SLIDING_WINDOW=8192`	~40K tokens	~25GB
`GEMMA3_SLIDING_WINDOW=0`	~50K tokens	~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

Bug fixes and improvements
Performance optimizations
Documentation improvements
Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

MLX - Apple's ML framework
mlx-lm - LLM inference library
mlx-vlm - Vision-language models
mlx-audio - Text-to-Speech and Speech-to-Text
mlx-embeddings - Text embeddings
Rapid-MLX - Community fork of vllm-mlx
vLLM - High-throughput LLM serving

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

May 9, 2026

This version

0.2.9

Apr 22, 2026

0.2.8

Apr 12, 2026

0.2.7

Mar 31, 2026

0.2.6

Feb 13, 2026

0.2.5

Jan 26, 2026

0.2.3

Jan 22, 2026

0.2.1

Jan 16, 2026

0.2.0

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_mlx-0.2.9.tar.gz (598.1 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_mlx-0.2.9-py3-none-any.whl (474.2 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file vllm_mlx-0.2.9.tar.gz.

File metadata

Download URL: vllm_mlx-0.2.9.tar.gz
Upload date: Apr 22, 2026
Size: 598.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vllm_mlx-0.2.9.tar.gz
Algorithm	Hash digest
SHA256	`9fbd81ce56f54be83ec99860c7c87405e5c5a31a0aadd55cc0f192141212740c`
MD5	`034bdde8c84ab1450fcd4c25da0bd241`
BLAKE2b-256	`33040b9da3d36bfa2ba2fd38dbca7827a6c5a14de0e3d1d65d7cac1925d994c3`

See more details on using hashes here.

File details

Details for the file vllm_mlx-0.2.9-py3-none-any.whl.

File metadata

Download URL: vllm_mlx-0.2.9-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 474.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for vllm_mlx-0.2.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0894e0c25ea1e850b7ca99dee5c82748bd49b44c1315a9c277bf90a662db5ef3`
MD5	`83074ab9d91a490c7397ce711d6694c8`
BLAKE2b-256	`ecb4a6b1f97ad8bd77f03b47e8fd867228f32b369df950963ede891ecf627dca`

See more details on using hashes here.

vllm-mlx 0.2.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vLLM-MLX

Overview

Features

Quick Start

Installation

Start Server

Use with OpenAI SDK

Use with Anthropic SDK

Multimodal (Images & Video)

Audio (TTS/STT)

Reasoning Models

Embeddings

Documentation

Architecture

Performance

Gemma 3 Support

Usage

Long Context Patch (mlx-vlm)

Contributing

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes