vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon
Project description
A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.
Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.
Quick Start
1. Install
Homebrew (recommended for Mac power users):
brew tap TheTom/tap && brew install vllm-swift
pip (everyone else, including dev containers and non-brew Macs):
pip install vllm-swift
The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.
From source:
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH (generated by install.sh)
2. Run
vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096 # increase as needed, max 40960
Homebrew users don't need
activate.sh—vllm-swift servehandles everything.
Server running at http://localhost:8000 (OpenAI-compatible API).
Drop-in replacement for vLLM on Apple Silicon. All
vllm serveflags work unchanged.
Performance (M5 Max 128GB)
Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.
Qwen3-0.6B
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 364 | 1,527 | 2,859 | 3,425 |
| vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |
Qwen3-4B
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 147 | 477 | 1,194 | 1,518 |
| vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |
Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.
TurboQuant+ KV Cache Compression
TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.
Qwen3.5 2B (4-bit weights)
| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |
|---|---|---|---|---|---|
| FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s |
| turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s |
| turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |
Architecture
The entire forward pass runs in Swift/Metal. Python is used only for orchestration.
Python (vLLM API, tokenization, scheduling) ← github.com/vllm-project/vllm
↓ ctypes FFI
C bridge (bridge.h)
↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
↓
Metal GPU
Features
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) - Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with
BatchedKVCache(fully batched projections + attention) - Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- TurboQuant+ KV cache compression (
turbo3,turbo4v2) via mlx-swift-lm - Decode and prompt logprobs
- Greedy and temperature sampling
- EOS / stop token detection (vLLM scheduler)
- VLM (vision-language model) support (experimental)
- Works with Hermes, OpenCode, and any OpenAI-compatible client
Use with AI tools
# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
--served-model-name qwen3-4b \
--enable-auto-tool-choice --tool-call-parser hermes
Then point your tool at it:
# Hermes — set in ~/.hermes/config.yaml:
# base_url: http://localhost:8000/v1
# model: qwen3-4b
# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode
# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'
Configuration
vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:
Basic serving
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960
Agent / tool calling (Hermes, OpenCode, etc.)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes
Chain-of-thought models (strip <think> tags)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-reasoning --reasoning-parser deepseek_r1
Long context with TurboQuant+
Compress KV cache 3-5× to fit longer context with modest throughput cost:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
| Scheme | Compression | Best for |
|---|---|---|
turbo4v2 |
~3× | Recommended — best quality/compression balance |
turbo3 |
~4.6× | Maximum compression, higher PPL trade-off |
Full setup (agent + reasoning + TurboQuant+)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes \
--enable-reasoning --reasoning-parser deepseek_r1 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
All flags
vllm-swift serve <model> [options]
--served-model-name NAME Clean model name for API clients (recommended)
--max-model-len N Max sequence length (default: model config)
--port PORT API server port (default: 8000)
--gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
--dtype float16 Model dtype (default: float16)
--enable-auto-tool-choice Enable tool/function calling
--tool-call-parser NAME Tool call format (hermes, llama3, mistral, etc.)
--enable-reasoning Enable chain-of-thought parsing
--reasoning-parser NAME Reasoning format (deepseek_r1, etc.)
--additional-config JSON Extra config (kv_scheme, kv_bits)
All standard vLLM flags work — these are just the most common ones.
Documentation
| Doc | What's in it |
|---|---|
| docs/PERFORMANCE.md | Full perf matrix vs vllm-metal, methodology, long-context cells |
| docs/MODEL_COMPATIBILITY.md | Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing) |
| docs/TROUBLESHOOTING.md | Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.) |
| CHANGELOG.md | Release history |
Changelog
See CHANGELOG.md for release history.
Known Limitations (early development)
- LoRA not supported (Swift engine limitation)
- Chunked prefill disabled (Swift engine handles full sequences)
- top_p sampling not supported in batched decode path (temperature works)
- Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
- Requires macOS on Apple Silicon (no Linux/CUDA)
Install
Homebrew
brew tap TheTom/tap && brew install vllm-swift
Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.
To update to the latest version:
vllm-swift update
# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift
From source
git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh # builds Swift, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
Manual (full control)
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
Troubleshooting
Homebrew checksum error on reinstall:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift
"No module named vllm" or plugin not loading after brew install:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup
vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:
# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup
# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm
activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.
Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:
cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
$(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/
Download a model
vllm-swift download mlx-community/Qwen3-4B-4bit
# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit
# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest
Project Structure
vllm_swift/ Python plugin (vLLM WorkerBase)
swift/
Sources/VLLMBridge/ C bridge (@_cdecl exports)
bridge.h C API (prefill, decode, batched decode)
scripts/
install.sh One-step build + install
build_bottle.sh Build + upload Homebrew bottle
integration_test.sh End-to-end smoke test
homebrew/
vllm-swift.rb Homebrew formula
tests/ 84 tests, 97% coverage
Requirements
- macOS 14+ on Apple Silicon
- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
- Python 3.10+
- vLLM 0.19+
- mlx-swift-lm (pulled automatically by Swift Package Manager)
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_swift-0.4.2-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: vllm_swift-0.4.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 9.3 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
547715a9a34ec0433e46217c93c04099b27366d3bc409354947a3ad99b507c12
|
|
| MD5 |
57ee5c1318ad166e0d6e8bfc23f11d71
|
|
| BLAKE2b-256 |
d2c6281cf888621d8647061916ecdae41caa282db500a1cca992f88d696dd4f2
|