vllm-swift

vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon

These details have not been verified by PyPI

Project links

Project description

vllm-swift

A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

Homebrew (recommended for Mac power users):

brew tap TheTom/tap && brew install vllm-swift

pip (everyone else, including dev containers and non-brew Macs):

pip install vllm-swift

The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.

From source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don't need activate.sh — vllm-swift serve handles everything.

Server running at http://localhost:8000 (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.

Qwen3-0.6B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	364	1,527	2,859	3,425
vllm-metal (Python/MLX)	111	652	2,047	2,620

Qwen3-4B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	147	477	1,194	1,518
vllm-metal (Python/MLX)	104	396	1,065	1,375

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

KV Cache	Compression	Prefill @1K	Decode @1K	Prefill @4K	Decode @4K
FP16	1.0×	1,252 tok/s	259 tok/s	1,215 tok/s	249 tok/s
turbo4v2	3.0×	1,331 tok/s	245 tok/s	1,245 tok/s	240 tok/s
turbo3	4.6×	1,346 tok/s	174 tok/s	1,276 tok/s	241 tok/s

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

OpenAI-compatible API (/v1/completions, /v1/chat/completions)
Streaming (SSE) responses
Chat templates (applied by vLLM, model-specific)
Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
Per-request temperature sampling in batched path
Auto model download from HuggingFace Hub
TurboQuant+ KV cache compression (turbo3, turbo4v2) via mlx-swift-lm
Decode and prompt logprobs
Greedy and temperature sampling
EOS / stop token detection (vLLM scheduler)
VLM (vision-language model) support (experimental)
Works with Hermes, OpenCode, and any OpenAI-compatible client

Use with AI tools

# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes

Then point your tool at it:

# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'

Configuration

vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:

Basic serving

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960

Agent / tool calling (Hermes, OpenCode, etc.)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes

Chain-of-thought models (strip `<think>` tags)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1

Long context with TurboQuant+

Compress KV cache 3-5× to fit longer context with modest throughput cost:

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

Scheme	Compression	Best for
`turbo4v2`	~3×	Recommended — best quality/compression balance
`turbo3`	~4.6×	Maximum compression, higher PPL trade-off

Full setup (agent + reasoning + TurboQuant+)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

All flags

vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)

All standard vLLM flags work — these are just the most common ones.

Documentation

Doc	What's in it
docs/PERFORMANCE.md	Full perf matrix vs vllm-metal, methodology, long-context cells
docs/MODEL_COMPATIBILITY.md	Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing)
docs/TROUBLESHOOTING.md	Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.)
CHANGELOG.md	Release history

Changelog

See CHANGELOG.md for release history.

Known Limitations (early development)

LoRA not supported (Swift engine limitation)
Chunked prefill disabled (Swift engine handles full sequences)
top_p sampling not supported in batched decode path (temperature works)
Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
Requires macOS on Apple Silicon (no Linux/CUDA)

Install

Homebrew

brew tap TheTom/tap && brew install vllm-swift

Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.

To update to the latest version:

vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift

From source

git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Manual (full control)

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Troubleshooting

Homebrew checksum error on reinstall:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift

"No module named vllm" or plugin not loading after brew install:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup

vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:

# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm

activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.

Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:

cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/

Download a model

vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest

Project Structure

vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage

Requirements

macOS 14+ on Apple Silicon
Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
Python 3.10+
vLLM 0.19+
mlx-swift-lm (pulled automatically by Swift Package Manager)

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.3

May 7, 2026

0.5.2

May 7, 2026

0.5.1

May 7, 2026

0.4.2

May 5, 2026

0.4.1

May 5, 2026

0.4.0

May 5, 2026

0.3.3

May 4, 2026

0.3.2

May 4, 2026

0.3.1

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_swift-0.5.3-py3-none-macosx_11_0_arm64.whl (9.3 MB view details)

Uploaded May 7, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file vllm_swift-0.5.3-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: vllm_swift-0.5.3-py3-none-macosx_11_0_arm64.whl
Upload date: May 7, 2026
Size: 9.3 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for vllm_swift-0.5.3-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`fe1f26938ea0b59b0fb0cc54b9c6b99fa9900e96367d201964ce5a788bb32349`
MD5	`81ac39be45ff186a31c8759a1de91743`
BLAKE2b-256	`d0be85c7bc6bda686a1776425b2053dd895014ffaaa51bd9abd4f3f0b221da9f`

See more details on using hashes here.

vllm-swift 0.5.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick Start

1. Install

2. Run

Performance (M5 Max 128GB)

Qwen3-0.6B

Qwen3-4B

TurboQuant+ KV Cache Compression

Architecture

Features

Use with AI tools

Configuration

Basic serving

Agent / tool calling (Hermes, OpenCode, etc.)

Chain-of-thought models (strip <think> tags)

Long context with TurboQuant+

Full setup (agent + reasoning + TurboQuant+)

All flags

Documentation

Changelog

Known Limitations (early development)

Install

Homebrew

From source

Manual (full control)

Troubleshooting

Download a model

Project Structure

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Chain-of-thought models (strip `<think>` tags)