Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raullenchai

These details have not been verified by PyPI

Project description

Rapid-MLX

The fastest way to run AI models on Apple Silicon.

Rapid-MLX is a local LLM inference engine for Macs. It speaks the OpenAI Chat / Responses / Embeddings APIs, so anything that talks to ChatGPT — Cursor, Claude Code, Aider, Codex CLI, LangChain, PydanticAI, your own scripts — just works against http://localhost:8000. No cloud, no API key, no per-token cost.

_{rapidmlx.com ·
Desktop app ·
Community benchmarks ·
Model mirror}

Rapid-MLX demo — install, serve Gemma 4, chat, tool calling
pip install → serve Gemma 4 → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.

What runs on your Mac

	Your Mac	Model	Speed (tok/s)
16 GB MacBook Air	Qwen3.5-4B	147 tok/s	Chat, coding, tools
24 GB MacBook Pro	Qwen3.5-9B	101 tok/s	Great all-rounder
32+ GB Mac Mini / Studio	Gemma 4 12B	64 tok/s	Vision + tools
32+ GB Mac Mini / Studio	GPT-OSS 20B	119 tok/s	Harmony-native, 100% tools
32+ GB Mac Mini / Studio	Qwen3.6-35B-A3B	93 tok/s	256 MoE experts, 262K context
48+ GB Mac Mini / Studio	Qwen3.5-35B-A3B 8bit	80 tok/s	Best balance of smart + fast
96+ GB Mac Studio / Pro	Qwen3.5-122B	57 tok/s¹	Frontier-level intelligence
128+ GB Mac Studio Ultra	DeepSeek V4 Flash 158B-A13B	31–56 tok/s¹	Day-0 frontier MoE, 1M context

_{Single-user end-to-end throughput (B=1: one request at a time, 256 max output tokens, output_tokens / wall-clock incl. first-token latency), median of 3 rounds. chat_template_kwargs.enable_thinking=False passed where the engine honours it. Tested on M3 Ultra 256 GB / rapid-mlx v0.6.83 (fused top-p sampler). ¹ carried over from 2026-04 bench — disk-constrained on this refresh. The full sizing table with HuggingFace links is in Choose Your Model.}

New to local AI? Quick glossary

tok/s (tokens per second) — roughly how many words the AI generates per second. Higher = faster.
4bit / 8bit — compression levels for models. 4bit uses less memory (recommended); 8bit is higher quality.
TTFT (Time To First Token) — how long before the AI starts responding.
Tool calling — the AI can call functions in your code. Used by Cursor, Claude Code, and coding assistants.
OpenAI-compatible — Rapid-MLX speaks the same HTTP API as ChatGPT, so any app that works with ChatGPT can work with Rapid-MLX by pointing at http://localhost:8000/v1.
Ollama / llama.cpp — other popular local-AI runtimes. The only apples-to-apples row in our benchmark is GPT-OSS 20B (identical weights both sides) — Rapid-MLX runs it 2.3× faster than Ollama under B=4 concurrent load. On the Qwen3 closest-tag rows (Qwen3.5/3.6 DeltaNet isn't on llama.cpp yet) Rapid-MLX leads 1.7–2.4×. The Gemma 4 row ties Ollama's Gemma 3 (different architectures, 1.0×). Against mlx-lm serve (same MLX weights) Rapid-MLX is 1.2–1.5× faster. Full caveats in Benchmarks.

Quick Start

1. Install (pick one):

# uv (recommended — one command, isolated env, auto-manages Python)
uv tool install rapid-mlx@latest
# Don't have uv yet? curl -LsSf https://astral.sh/uv/install.sh | sh

# Or one-liner with auto-setup (installs Python if needed)
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

# Homebrew (Mac-native — needs tap + trust before install on Homebrew 4.x)
brew tap raullenchai/rapid-mlx
brew trust raullenchai/rapid-mlx
brew install rapid-mlx

# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)
pip install rapid-mlx

Upgrade later: uv tool upgrade rapid-mlx / brew upgrade rapid-mlx / pip install -U rapid-mlx.

2. Chat with a model right now:

rapid-mlx chat

Defaults to qwen3.5-4b-4bit. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type /help for slash commands, /exit to quit. Pass --think to surface chain-of-thought, --no-think to skip it for faster replies.

3. Or serve it for use from other apps:

rapid-mlx serve qwen3.5-4b-4bit

Starts an OpenAI-compatible HTTP server. Wait for Ready: http://localhost:8000/v1, then point Cursor, Claude Code, Aider, LangChain, the OpenAI SDK — anything — at http://localhost:8000/v1.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Say hello"}],
).choices[0].message.content)

Vision/multimodal models (Gemma 4, Qwen-VL) need extras: pip install 'rapid-mlx[vision]'. Text-only install is ~460 MB; vision adds ~322 MB. See Optional Extras.

"No matching distribution" from pip? Your Python is too old. Run python3 --version — if it says 3.9, run brew install python@3.12 then python3.12 -m pip install rapid-mlx.

Refusing to load formula ... from untrusted tap? Homebrew 4.x requires third-party taps to be explicitly trusted before install. Run brew trust raullenchai/rapid-mlx (per-machine, persists across upgrades) then retry brew install rapid-mlx.

brew install stalls on Tapping homebrew/core? Brew 5.x's install sandbox can't auto-tap homebrew/core mid-install. Pre-tap it once, then retry:
brew tap homebrew/core --force   # ~1.3 GB, one-time
brew tap raullenchai/rapid-mlx
brew trust raullenchai/rapid-mlx
brew install rapid-mlx

Not into the terminal? Rapid-MLX Desktop bundles the same engine inside a one-click Mac app — drag to Applications, pick a model, chat. The CLI here is still the source of truth for serving and scripting; the desktop app is the friendlier on-ramp.

Why Rapid-MLX

If you're comparing against Ollama, LM Studio, mlx-lm serve, or llama.cpp:

Fastest on Apple Silicon, measured. Under concurrent load (B=4), Rapid-MLX is 2.3× faster than Ollama on the apples-to-apples GPT-OSS 20B row (identical weights both sides), leads by 1.7–2.4× across the Qwen3.5 / Qwen3.6 closest-tag Ollama rows, and runs 1.2–1.5× faster than mlx-lm serve on shared MLX weights. Full table in Benchmarks.
Drop-in OpenAI / Anthropic API. Same /v1/chat/completions and /v1/responses as OpenAI, plus /v1/messages for the Anthropic SDK and Claude Code. No client-side adapter required.
Tool calling that actually works under 4-bit quantization. 17 parser formats with automatic recovery — when a quantized model emits a malformed tool call as text, Rapid-MLX detects it and rebuilds the structured tool_calls object instead of failing silently. Verified end-to-end against Cursor, Codex CLI, Claude Code, Aider, LangChain, PydanticAI, smolagents.
Prompt cache, even on hybrid RNN models, sharable across tenants. Standard transformers get radix-tree prefix cache — N clients hitting the same 2k-token system prompt converge on the same trie node (O(prefix_len) lookup vs. O(log N + LCP_scan) on hash caches). Hybrid models (Qwen3.5 DeltaNet) get RNN state snapshots restored in ~0.1 ms instead of re-running hundreds of tokens through the recurrent layers. 2–5× faster TTFT on every architecture, always on.
Long-prompt prefill acceleration. PFlash scores 32K+ prompts and only prefills the sink + recent tail + query-relevant middle — 3.87–8.5× faster cold-start TTFT with full needle-in-a-haystack recall. Default-on for verified aliases; --pflash always forces it for any model.
TurboQuant K8V4 KV codec, default-on for verified MoE. 9 Qwen3.5/3.6 hero aliases ship with K8V4 (K 8-bit + V 4-bit after Walsh-Hadamard + Lloyd-Max) turned on — codec compresses KV to ~1/2.4 (~58% savings), lossless across the verified matrix. Set --kv-cache-turboquant none to force off.
Day-0 frontier models. DeepSeek V4 Flash 158B-A13B, Qwen3.6 with 262K context, DiffusionGemma 26B (non-autoregressive), Gemma 4 vision — all behind the same Chat Completions API.

Use Cases

The same rapid-mlx binary covers four very different workflows. Pick the one closest to what you want to do:

1. Chat in the terminal

rapid-mlx chat qwen3.5-9b-4bit

A streaming REPL with slash commands (/help, /system, /save, /clear). No second process needed. Use --think / --no-think to control chain-of-thought. Aliased as rapid-mlx run for Ollama muscle memory.

2. OpenAI-compatible server for your apps

rapid-mlx serve qwen3.5-9b-4bit

Point Cursor, Continue.dev, Open WebUI, LibreChat, the OpenAI Python SDK, LangChain, PydanticAI, smolagents at http://localhost:8000/v1. Tool calling, streaming, reasoning separation, structured JSON output, and embeddings are all on the standard endpoints. See Connect Your Client for one-shot configs.

3. Agent backends: Codex CLI, Claude Code, Aider, OpenCode

Rapid-MLX is the backend — pair it with an open-source agent CLI for the full Claude Code-like experience.

rapid-mlx serve qwen3.6-27b-8bit          # or any tool-capable alias
rapid-mlx agents codex --setup            # writes ~/.codex/config.toml
codex                                      # now talks to your local model

Codex CLI (OpenAI's official Rust agent, 0.136.0+) hits /v1/responses. Verified end-to-end on Qwen3.5-9B / Qwen3.6-27B for chat, file read/write, shell, multi-step, and source analysis. Release gate G7b probes the codex-shape SSE on every tag. (guide)
Claude Code hits /v1/messages (Anthropic SDK). ANTHROPIC_BASE_URL=http://localhost:8000 claude.
Aider / OpenCode / OpenClaude / Goose / Hermes / Claw Code — rapid-mlx agents <name> --setup writes the right config file.

4. Benchmark your hardware and contribute to the community

rapid-mlx bench qwen3.5-9b-4bit --submit

Runs the standardized B=1 community benchmark (greedy, 128 + 512 token buckets, 5 rounds each), shows you the JSON payload, asks for consent, and opens the PR via gh if you have it (or prints a compare-page URL if you don't). Submitted rows land in community-benchmarks/submissions/ and show up on rapidmlx.com once merged. Your bench helps everyone else pick the right model for their Mac.

Bonus: share a public URL

rapid-mlx share qwen3.6-27b-8bit

Tunnels your local server through rapidserver.quicksilverpro.io over a WebSocket. Prints a public OpenAI-compatible endpoint plus a bearer key — point any chat UI or OpenAI SDK at it. Bearer auth, CORS allowlist, and a 120 RPM rate-limit are wired automatically; closing the terminal tears the tunnel down. The default chat surface is our hosted Big-AGI fork (tool calling, personas, voice — no signup); any OpenAI-compatible client also works. Pick a 27B-class model or larger for a usable share experience.

Connect Your Client

Tested integrations

Agent harnesses (auto-setup with rapid-mlx agents <name> --setup):

Harness	Type	Notes
Codex CLI	Agent	OpenAI's official Rust agent — `/v1/responses` shim, verified end-to-end against codex 0.136.0 (guide)
Claude Code	Agent	Anthropic SDK via `/v1/messages` — `ANTHROPIC_BASE_URL=http://localhost:8000 claude`
Aider	Agent	CLI edit-and-commit, architect mode (test)
OpenCode	TUI Agent	Claude Code-like terminal UX, OpenAI-compat provider
Hermes Agent	Agent	62 tools, multi-turn (test)
Goose	Agent	Ollama provider via `OLLAMA_HOST`
OpenClaude	Agent	Anthropic SDK, `CLAUDE_CODE_USE_OPENAI=1` (test)
Claw Code	Agent	OpenAI & Anthropic endpoints
PydanticAI	Framework	Typed agents, structured output (test)
LangChain	Framework	`ChatOpenAI`, tools, streaming (test)
smolagents	Framework	CodeAgent + ToolCallingAgent (test)

UI / IDE clients:

Client	Status	Setup
Cursor	Compatible	Settings → OpenAI Base URL
Continue.dev	Compatible	VS Code / JetBrains extension
LibreChat	Tested	Docker (test)
Open WebUI	Tested	Docker (test)
Any OpenAI-compatible app	Compatible	Point at `http://localhost:8000/v1`

Quick setup snippets

Cursor — Settings → Models → Add Model:

OpenAI API Base:  http://localhost:8000/v1
API Key:          not-needed
Model name:       default          (or qwen3.5-9b-4bit — either works)

Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.

Codex CLI — rapid-mlx agents codex --setup writes ~/.codex/config.toml for you. Or by hand:

model = "default"
model_provider = "rapid-mlx"

[model_providers.rapid-mlx]
name = "Rapid-MLX (local)"
base_url = "http://localhost:8000/v1"
# If rapid-mlx was started with --api-key, add env_key = "RAPID_MLX_API_KEY"
# and `export RAPID_MLX_API_KEY=...`. Don't use api_key = "..." — Codex
# CLI's --strict-config rejects inline literals.

Then codex (or codex exec '<query>') routes through /v1/responses. See the Codex CLI guide.

Claude Code:

ANTHROPIC_BASE_URL=http://localhost:8000 ANTHROPIC_API_KEY=not-needed claude

Aider:

aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed

Goose:

GOOSE_PROVIDER=ollama OLLAMA_HOST=http://localhost:8000 \
GOOSE_MODEL=default goose run --text "hello"

More client setup snippets (Continue.dev, OpenCode, LibreChat, Open WebUI, Hermes, PydanticAI, smolagents, Anthropic SDK)

Continue.dev (~/.continue/config.yaml):

models:
  - name: rapid-mlx
    provider: openai
    model: default
    apiBase: http://localhost:8000/v1
    apiKey: not-needed

OpenCode (opencode.json in your project root):

{
  "provider": {
    "openai": {
      "api": "http://localhost:8000/v1",
      "models": {
        "default": {
          "name": "rapid-mlx local",
          "limit": { "context": 32768, "output": 8192 }
        }
      },
      "options": { "apiKey": "not-needed" }
    }
  }
}

Open WebUI (Docker one-liner):

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e ENABLE_OLLAMA_API=False \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=not-needed \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

LibreChat (librechat.yaml, under endpoints.custom):

- name: "Rapid-MLX"
  apiKey: "rapid-mlx"
  baseURL: "http://localhost:8000/v1/"
  models:
    default: ["default"]
    fetch: true
  titleConvo: true
  titleModel: "current_model"
  modelDisplayLabel: "Rapid-MLX"

Hermes Agent (~/.hermes/config.yaml):

model:
  provider: "custom"
  default: "default"
  base_url: "http://localhost:8000/v1"
  context_length: 32768

OpenClaude:

CLAUDE_CODE_USE_OPENAI=1 OPENAI_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=not-needed OPENAI_MODEL=default openclaude -p "hello"

Claw Code:

export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
claw --model "openai/default" prompt "summarize this repo"

PydanticAI (pip install pydantic-ai):

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",
    ),
)
print(Agent(model).run_sync("What is 2+2?").output)

smolagents (pip install smolagents):

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(
    model_id="default",
    api_base="http://localhost:8000/v1",
    api_key="not-needed",
)
CodeAgent(tools=[], model=model).run("What is 5 multiplied by 7?")

Anthropic SDK (pip install anthropic):

from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

message = client.messages.create(
    model="default",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Say hello"}],
)
print(message.content[0].text)

Model-Harness Index (MHI)

MHI measures how well a model works with a specific agent harness. It combines three dimensions:

Dimension	Weight	What it measures	Source
Tool Calling	50%	Can the model+harness execute function calls correctly?	`rapid-mlx agents --test`
HumanEval	30%	Can the model generate correct code?	HumanEval (10 tasks)
MMLU	20%	Does the harness degrade base knowledge?	tinyMMLU (10 tasks)

MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)

Model	Best MHI	Best Harness	Tool Calling
Qwopus 27B	92	Hermes / PydanticAI / LangChain / smolagents	100%
Qwen3.5 27B	82	Hermes / PydanticAI / LangChain	100%
Llama 3.3 70B	83	smolagents (text-based)	100%
Nemotron Nano 30B	59	PydanticAI / LangChain	91–93%
Gemma 4 26B	62	Hermes / smolagents	100%

Run rapid-mlx agents to list supported agents and python3 scripts/mhi_eval.py to compute MHI on your own setup.

Command Reference

Command	What it does
`rapid-mlx chat [model]`	Interactive chat REPL (alias: `rapid-mlx run`)
`rapid-mlx serve <model>`	Start an OpenAI-compatible HTTP server
`rapid-mlx share <model>`	Same as `serve`, but tunneled to a public URL
`rapid-mlx agents [name]`	List agent integrations; `--setup` / `--test` per integration
`rapid-mlx bench <model>`	Throughput benchmark; `--submit` runs the standardized B=1 community bench and opens a PR
`rapid-mlx models`	List all aliases (`--cached` shows only locally-downloaded ones; `ls` is a top-level alias)
`rapid-mlx pull <model>`	Download a model to the HF cache without starting a server
`rapid-mlx rm <model>`	Remove a cached model
`rapid-mlx ps`	List running `rapid-mlx` servers
`rapid-mlx info <model>`	Show the per-model profile (parsers, hybrid / MoE flags, DFlash eligibility)
`rapid-mlx doctor`	Self-diagnostic — Metal, imports, CLI, inference pipeline
`rapid-mlx upgrade`	Detect install method (brew / pip / install.sh) and upgrade
`rapid-mlx telemetry <status\|enable\|disable\|preview\|reset>`	Manage anonymous usage telemetry (off by default; see Telemetry)
`rapid-mlx version`	Print the installed version

Every subcommand supports --help. Tab completion is wired automatically via argcomplete — see shell completion if it doesn't fire.

Choose Your Model

What fits my Mac?

The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monitor shows red memory pressure, pick a smaller model from the table below.

Your Mac	Best Model	RAM Used	Speed (B=1)	Quality
16 GB MacBook Air/Pro	Qwen3.5-4B 4bit	2.4 GB	147 tok/s	Good for chat and simple tasks
24 GB MacBook Pro	Qwen3.5-9B 4bit	5.1 GB	101 tok/s	Great all-rounder
32 GB Mac Mini / Studio	Qwen3.5-27B 4bit	15.3 GB	37 tok/s	Solid coding model
32 GB Mac Mini / Studio	Gemma 4 12B 4bit	7 GB	64 tok/s	Vision-capable + tool calling
32 GB Mac Mini / Studio	GPT-OSS 20B MXFP4	11 GB	119 tok/s	Harmony-native, 100% tools
32 GB Mac Mini / Studio	Qwen3.6-35B-A3B 4bit	20 GB	93 tok/s	256 MoE experts, 262K context
36 GB MacBook Pro M3/M4 Pro	Qwen3.5-27B 4bit	15.3 GB	37 tok/s	Same as 32 GB — extra headroom for long contexts
48 GB Mac Mini / Studio	Qwen3.5-35B-A3B 8bit	37 GB	80 tok/s	Sweet spot — smart + fast
64 GB Mac Mini / Studio	Qwen3.5-35B-A3B 8bit	37 GB	80 tok/s	Same model, more room for KV cache
96 GB Mac Studio / Pro	Qwen3.5-122B mxfp4	65 GB	57 tok/s¹	Best model, fits comfortably
128 GB Mac Studio / Pro	DeepSeek V4 Flash 2-bit DQ	91 GB	56 tok/s¹	158B-A13B frontier MoE, day-0 (chat only)
192 GB Mac Studio / Pro	Qwen3.5-122B 8bit	130 GB	44 tok/s¹	Maximum quality
256 GB Mac Studio Ultra	DeepSeek V4 Flash 8-bit	136 GB	31 tok/s¹	158B-A13B frontier MoE, 1M context (chat only)

_{Speed = single-user end-to-end throughput (B=1: one request, 256 max output tokens, output_tokens / wall-clock including first-token latency), median of 3 rounds. rapid-mlx v0.6.83 (fused top-p sampler) on M3 Ultra 256 GB, 2026-06-09. ¹ Carried over from prior bench (disk-constrained on this refresh).}

Not getting the speed in the table? Most likely the model is reasoning before answering — add --no-think to skip chain-of-thought. See Troubleshooting.

4bit vs 8bit: 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. "mxfp4" is a high-quality 4-bit format.

Alias naming convention

Every alias follows the same template:

<family>-<version>-<params>-<modality?>-<technique?>-<quant>

Segment	Meaning	Examples
family	Model family	`gemma`, `qwen`, `llama`, `mistral`, `deepseek`, `phi`
version	Major version	`-4`, `3.5`, `3.6`, `-r1`, `-v4-flash`
params	Parameter count (MoE includes active count)	`12b`, `27b`, `35b-a3b` (35B total / 3B active)
modality (optional)	Non-text variants	`-vl` (vision), `-coder` (code)
technique (optional)	Training-time modifier	`-qat` (Quantization-Aware Training), `-distill`, `-thinking`
quant (mandatory)	Quantization tier	`-4bit`, `-8bit`, `-mxfp4`, `-qat-8bit`, …

The quantization suffix is mandatory on every alias — qwen3.5-4b-4bit not qwen3.5-4b. This mirrors LM Studio's …-MLX-4bit HuggingFace convention so you never guess the bit width. -qat is a technique suffix, not a quant — it stacks before the quant (so a QAT-trained Gemma 4 12B in 4-bit is gemma-4-12b-qat-4bit).

Quant suffix	Meaning
`-4bit`	Standard MLX 4-bit (most common)
`-8bit`	Standard MLX 8-bit (higher quality, ~2× RAM)
`-2bit`, `-3bit`, `-6bit`	Other bit widths
`-mxfp4`	Microscaling FP4 (high-quality 4-bit)
`-mxfp4-q8`	MXFP4 weights + Q8 head (GPT-OSS style)
`-dwq`	Dynamic Weight Quantization (mlx-community)
`-ud`	Unsloth Dynamic (mixed-precision per-layer)
`-unpacked`	Original FP16 / BF16 weights, no quantization

Full model lineup

128+ explicit aliases across 30+ families ship today (rapid-mlx models), including audio (13 TTS + 13 STT) and embedding aliases. rapid-mlx info <alias> shows the per-alias profile — parser, hybrid / MoE flags, K8V4 eligibility, DFlash / MTP eligibility.

Text families at a glance

Family	Aliases	Notable
Qwen3.5	`qwen3.5-4b-4bit`, `-4b-8bit`, `-9b-4bit`, `-9b-8bit`, `-9b-6bit`, `-27b-4bit`, `-27b-6bit`, `-27b-8bit`, `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-122b-mxfp4`, `-122b-8bit`	DeltaNet hybrid; 27b-8bit DFlash-eligible; K8V4-verified: `-9b-4bit`, `-9b-8bit`, `-27b-4bit`, `-27b-8bit`, `-35b-6bit`
Qwen3.6	`qwen3.6-27b-4bit`, `-27b-8bit`, `-27b-ud`, `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud`	262K ctx, 256 MoE experts; 27b-8bit DFlash-eligible; K8V4-verified: `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`
Qwen3	`qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder-4bit`, `qwen3-coder-30b-4bit`, `qwen3-vl-4b-4bit`, `-8b-4bit`, `-30b-4bit`	Coding + vision
Qwopus	`qwopus-9b-4bit`, `qwopus-27b-4bit`, `qwopus-27b-8bit`	92 MHI on tool calling
DeepSeek	`deepseek-r1-8b-4bit`, `-32b-4bit`, `deepseek-coder-v2-lite-16b-4bit`, `deepseek-v4-flash-2bit`, `-4bit`, `-8bit`	R1 reasoning + V4 Flash 158B-A13B day-0
Gemma	`gemma-3n-e4b-4bit`, `gemma-4-12b-4bit`, `-12b-8bit`, `-12b-qat-4bit`, `-12b-qat-8bit`, `-26b-4bit`, `-26b-qat-4bit`, `-31b-4bit`, `-31b-8bit`, `-31b-qat-4bit`, `-31b-qat-8bit`, `gemma3-1b-4bit`, `-12b-4bit`, `-27b-4bit`	Vision-capable; QAT variants
Llama / Hermes	`llama3-1b-4bit`, `-3b-4bit`, `llama-3.1-8b-4bit`, `-8b-8bit`, `hermes3-8b-4bit`, `hermes4-70b-4bit`
GLM	`glm4.5-air-4bit`, `glm4.7-9b-4bit`
GPT-OSS	`gpt-oss-20b-mxfp4-q8`	Harmony native
MiniMax / Kimi	`minimax-m2.5-4bit`, `minimax-m2.7-mxfp4`, `kimi-48b-4bit`, `kimi-k2.5-3bit`
Mistral / Devstral	`mistral-24b-4bit`, `devstral-24b-4bit`, `devstral-v2-24b-4bit`, `ministral-3b-4bit`
Phi / Granite / Nemotron / Bonsai / SmolLM	`phi-4-14b-4bit`, `phi-4-mini-4bit`, `smollm3-3b-4bit`, `nemotron-30b-4bit`, `bonsai-1.7b-unpacked`, `-4b-unpacked`, `-8b-unpacked`, `granite4-tiny-4bit`
Text-Diffusion	`diffusion-gemma-26b-4bit`, `diffusion-gemma-26b-8bit`	Non-autoregressive (block denoising); same `/v1/chat/completions` API
Tmax / VibeThinker / Nanbeige / Holo3.1 / Bonsai / SmolLM / EmbeddingGemma / UI	`tmax-9b-4bit`, `tmax-27b-4bit`, `vibethinker-`, `nanbeige4.1-`, `holo3.1-a3b-`, `ui-`, `embeddinggemma-*`	Community aliases (MLX-first-mover, agentic, GUI, embeddings)

TurboQuant K8V4 KV codec is default-on for 9 verified Qwen3.5/3.6 MoE aliases (see Features); DFlash and MTP speculative drafters are opt-in per alias (rapid-mlx info <alias> shows what's eligible).

Copy-paste serve commands

# 16 GB — lightweight, fast
rapid-mlx serve qwen3.5-4b-4bit --port 8000

# 24 GB — best small model
rapid-mlx serve qwen3.5-9b-4bit --port 8000

# 32 GB — Gemma 4 12B (vision-capable, 64 tok/s)
rapid-mlx serve gemma-4-12b-4bit --port 8000

# 32 GB — GPT-OSS 20B (harmony-native, 100% tool calling, 119 tok/s)
rapid-mlx serve gpt-oss-20b-mxfp4-q8 --port 8000

# 32+ GB — Qwen 3.6 35B-A3B (256 experts, 262K context, 93 tok/s)
rapid-mlx serve qwen3.6-35b-4bit --port 8000

# 48+ GB — sweet spot (Qwen3.5-35B-A3B 8bit, 80 tok/s)
rapid-mlx serve qwen3.5-35b-8bit --prefill-step-size 8192 --port 8000

# 96+ GB — frontier (Qwen3.5-122B mxfp4)
rapid-mlx serve qwen3.5-122b-mxfp4 --prefill-step-size 8192 --port 8000

# Coding agent — fast MoE
rapid-mlx serve qwen3-coder-4bit --prefill-step-size 8192 --port 8000

# Vision — image understanding (needs [vision] extras)
rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000

# Text-diffusion — DiffusionGemma 26B-A4B (block denoising, needs [vision] for mlx-vlm 0.6.3+)
rapid-mlx serve diffusion-gemma-26b-4bit --port 8000

Vision deps: Install into the same environment where rapid-mlx lives:

pip users: pip install 'rapid-mlx[vision]' (in the same venv)

install.sh users: ~/.rapid-mlx/bin/pip install 'rapid-mlx[vision]'

brew users: $(brew --prefix)/opt/rapid-mlx/libexec/bin/pip install 'rapid-mlx[vision]'

Parser auto-detection & manual overrides

Parsers are auto-detected from the model name — you don't need to specify --tool-call-parser or --reasoning-parser for supported families. Explicit flags always override auto-detection.

Model Family	`--tool-call-parser`	`--reasoning-parser`	Notes
Qwen3.5 (all sizes)	`hermes`	`qwen3`	Recommended — 100% tool calling
Qwen3.6	`qwen3_coder_xml`	`qwen3`	XML tool format, 262K context
Qwen3-Coder-Next	`hermes`	(none)	Fast coding, non-thinking mode
DeepSeek R1-0528 / V3.1	`deepseek_v31`	`deepseek_r1`	Dedicated V3.1 parser
DeepSeek R1 (older)	`deepseek`	`deepseek_r1`	With reasoning
DeepSeek V3 / V2.5	`deepseek`	(none)	No reasoning parser
GLM-4.7	`glm47`	(none)	100% tool calling
MiniMax-M2.5	`minimax`	`minimax`	XML tool format
GPT-OSS	`harmony`	`harmony`	Native format
Kimi-Linear	`kimi`	(none)	Kimi tool format
Llama 3.x	`llama`	(none)	JSON tool format
Mistral / Devstral	`hermes`	(none)	Hermes-compatible
Gemma	`hermes`	(none)	Hermes-compatible
Phi-3/4	`hermes`	(none)	Hermes-compatible

All 17 parsers include automatic recovery — if a quantized model outputs broken tool calls as text, they're auto-converted back to structured format.

Text-Diffusion (DiffusionGemma 26B-A4B)

DiffusionGemma is a non-autoregressive language model — instead of emitting one token at a time, it denoises whole blocks of tokens in parallel via a diffusion process. Rapid-MLX wraps it behind the standard OpenAI Chat Completions API.

pip install 'rapid-mlx[vision]'       # mlx-vlm 0.6.3+ provides the diffusion runtime
rapid-mlx serve diffusion-gemma-26b-4bit --port 8000

B=1 single-user benchmark (M3 Ultra 256 GB, mlx-community/diffusiongemma-26B-A4B-it-4bit, median of 3 runs + 1 warmup):

`max_tokens`	TTFT	E2E	Aggregate tok/s
64	1.47s	1.47s	43
256	6.00s	6.00s	43
1024	5.71s	19.58s	37

Diffusion models emit tokens in whole denoising blocks, so the conventional decode_tok/s = tokens / (e2e − ttft) metric isn't meaningful here (ttft ≈ e2e for short outputs). The table reports aggregate throughput — tokens / total_wall_time — i.e. how many tokens actually land in the chat window per second. Throughput climbs with output length because the per-step denoising cost amortizes across more emitted tokens.

Reproduce: python3.12 scripts/bench_diffusion_gemma.py --port 8000.

Benchmarks

Tested on Mac Studio M3 Ultra (256 GB), 2026-06-06. Workload is B=4 sustained concurrent streaming (four parallel chat requests, 256 max output tokens each), median of 3 measured rounds after one warmup discard. Engines were swapped sequentially with an 8 s Metal cooldown so contention never crossed engine boundaries.

chat_template_kwargs.enable_thinking=False is passed to all engines that honour it (rapid-mlx, mlx-lm, mlx-vlm). Ollama 0.24 ignores that hook for Qwen3 and keeps streaming reasoning chunks — those decode at the same model rate as content tokens, so we count them, and the Qwen3 Ollama numbers reflect chain-of-thought-on throughput in practice. Token counts come from the streaming usage chunk (authoritative), not from counting SSE frames.

Versions: rapid-mlx v0.6.80, mlx-lm 0.31.3, Ollama 0.24.0 (latest stable).

Aggregate throughput = sum of output tokens across all four streams ÷ wall-clock seconds — the metric that matters for a server fronting multiple users or a TUI firing parallel sub-agents. Per-user decode is roughly aggregate ÷ 4 on a true batching engine; on Ollama 0.24 (no in-flight batching) the four streams effectively serialize.

Model (rapid-mlx alias)	rapid-mlx (B=4)	mlx-lm serve	Ollama tag (closest)	Ollama (B=4)	vs mlx-lm	vs Ollama
Qwen3.5-4B	261 tok/s	173	`qwen3:4b`¹	120	1.51x	2.18x
Qwen3.5-9B	180 tok/s	136	`qwen3:8b`¹	84	1.32x	2.14x
Qwen3.5-27B	66 tok/s	55	`qwen3:32b`²	27	1.20x	2.43x
Gemma 4 12B	55 tok/s	crash³	`gemma3:12b`⁴	56	—	1.00x
GPT-OSS 20B	221 tok/s	162	`gpt-oss:20b` ✅	97	1.36x	2.29x
Qwen3.6-35B-A3B (4-bit)	176 tok/s	129	`qwen3:30b-a3b`⁵	87	1.37x	2.02x
Qwen3.5-35B-A3B (8-bit)	151 tok/s	112	`qwen3:30b-a3b`⁵	87	1.35x	1.74x

✅ Direct apples-to-apples: identical weights both sides.

_{¹ Ollama Qwen3 base, not Qwen3.5 — DeltaNet hybrid arch isn't on llama.cpp yet. ² Closest dense Qwen3; Unsloth Qwen3.6-27B GGUF fails to load on Ollama 0.24. ³ mlx-lm 0.31.3 has no Gemma 4 loader (it lives in mlx-vlm). ⁴ Gemma 4 not yet on llama.cpp — Gemma 3 is the closest. ⁵ Closest MoE A3B available; Qwen3.5/3.6-35B-A3B don't have a llama.cpp build yet.}

Reproduce the throughput table:

python3.12 scripts/bench_readme_refresh.py \
  --models qwen3.5-4b-4bit,qwen3.5-9b-4bit,qwen3.5-27b-4bit,gemma-4-12b-4bit,gpt-oss-20b-mxfp4-q8,qwen3.6-35b-4bit,qwen3.5-35b-8bit \
  --engines rapid-mlx,mlx-lm,ollama

Raw JSON per round + per-stream tok/s land in reports/benchmarks/readme-refresh/. Want to add your hardware? Run rapid-mlx bench <alias> --submit — it opens a PR for you and your numbers show up on rapidmlx.com.

TTFT — Prompt Cache Advantage

Prompt cache keeps multi-turn conversations fast. For standard transformers, KV cache trimming gives sub-100ms TTFT. For hybrid RNN models (Qwen3.5 DeltaNet), we use state snapshots — the first technique to bring prompt cache to non-trimmable architectures on MLX.

_{Numbers below were last verified 2026-04 — the prefix-cache code path has not changed since.}

Pure KV cache (transformers):

Model	Rapid-MLX (cached)	mlx-lm serve	Speedup
Kimi-Linear-48B	0.08s	—	—
Llama 3.2 3B	0.10s	—	—
Hermes-3-Llama 8B	0.10s	0.18s	1.8x
Phi-4 Mini 14B	0.13s	0.15s	1.2x
Devstral-Small-2 24B	0.13s	0.38s	2.9x
Mistral Small 24B	0.13s	0.38s	2.9x
GLM-4.7-Flash 9B	0.13s	0.23s	1.8x
GLM-4.5-Air	0.14s	0.47s	3.4x
Qwen3-Coder-Next 80B	0.16s	0.27s	1.7x
GPT-OSS 20B	0.16s	0.27s	1.7x
Qwen3.5-9B	0.22s	0.26s	1.2x
Gemma 4 E4B	0.25s	— (day-0)	—
Gemma 4 26B-A4B	0.25s	— (day-0)	—
Gemma 4 31B	0.34s	0.57s (mlx-vlm bf16)	1.7x

DeltaNet state snapshots (hybrid RNN + attention):

Qwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines recreate the entire cache from scratch every request — we snapshot the RNN state at the system prompt boundary, restoring in ~0.1ms instead of re-running hundreds of tokens through the recurrent layers.

Model	Cold TTFT	Snapshot TTFT	Speedup
Qwen3-Coder-Next 6bit (48L)	0.66s	0.16s	4.3x
Qwen3.5-35B-A3B 8bit (40L)	0.49s	0.19s	2.6x
Qwen3.5-27B 4bit (40L)	0.58s	0.27s	2.1x
Qwen3.5-9B 4bit (40L)	0.27s	0.22s	1.2x
Qwen3.5-4B 4bit (32L)	0.24s	0.16s	1.5x

Capability comparison vs other Apple-Silicon runtimes

Feature	Rapid-MLX	oMLX	Ollama	llama.cpp	mlx-lm serve
Tool calling	100% (Qwen/GLM/GPT-OSS/Kimi)	N/A	100% (Qwen)	80% (Phi-4)	N/A
Tool call recovery	100%	N/A	100%	100%	N/A
Tool injection fallback	Yes	No	No	No	No
Think-tag leak	0%	N/A	0%	0%	N/A
Prompt cache	KV + DeltaNet	No	No	No	No
Vision	Yes	Yes	Yes	No	No
Audio (STT/TTS)	Yes	No	No	No	No
17 tool parsers	Yes	No	No	No	No
Cloud routing	Yes	No	No	No	No
Streaming	Yes	Yes	Yes	Yes	Yes
OpenAI API	Yes	Yes	Yes	Yes	Yes

Optimization techniques per model

Technique	What it does	Models
Radix prefix cache	O(prefix_len) trie lookup + copy-on-write nodes for multi-tenant shared system prompts	All transformer models (default `--prefix-cache-index radix`)
DeltaNet state snapshots	Deep-copy RNN state at prefix boundary, restore in ~0.1 ms	Qwen3.5 (4B, 9B, 27B, 35B, 122B), Qwen3-Coder-Next
Hybrid cache sync	Keep trimmable KV + non-trimmable RNN layers in sync	Qwen3.5 (Gated DeltaNet + attention)
PFlash sparse prefill	Score long prompt, prefill only sink + tail + relevant middle	Default-on for verified aliases; `--pflash always`
Tool logits bias	Jump-forward decoding — bias logits toward structured tokens	All models with `--enable-tool-logits-bias`
Auto tool recovery	Detect broken text-format tool calls, convert to structured	All 17 parser formats (incl. Gemma 4)
TurboQuant K8V4 codec	K 8-bit + V 4-bit after Walsh-Hadamard + Lloyd-Max (~1/2.4 compression, lossless across verified matrix)	Default-on for 9 verified Qwen3.5/3.6 aliases; `--kv-cache-turboquant k8v4\|v4\|none`
KV cache quantization	Quantize prefix cache entries to reduce memory	All models with `--kv-cache-quantization`
DFlash speculative decoding	Block-diffusion drafter, parallel draft + verify (single-user)	`qwen3.5-27b-8bit`, `qwen3.6-27b-8bit` with `--enable-dflash`
MTP speculative decoding	Multi-token prediction head shipped with the model	MTP-trained checkpoints with `--enable-mtp`
SuffixDecoding	Drafter-free, statistical n-gram lookup speculative decoding	All BatchedEngine models with `--suffix-decoding`
Prefill chunking	Configurable step size for large-prompt throughput	All models
Cloud routing	Offload high-token requests to cloud LLM when local is slow	All models with `--cloud-model`

Eval benchmarks (20 models, 4 suites)

Tool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), general knowledge (MMLU-Pro). Top models:

Model	Decode (B=1)	Tools	Code	Reason	General	Avg
Qwen3.5-122B 8bit	44 t/s¹	87%	90%	90%	90%	89%
Qwen3.5-35B 8bit	59 t/s	90%	90%	80%	80%	85%
Qwen3-Coder-Next 4bit	74 t/s¹	90%	90%	70%	70%	80%
Qwen3.5-27B 4bit	33 t/s	83%	90%	50%	80%	76%
Qwen3.5-9B 4bit	100 t/s	83%	70%	60%	70%	71%

_{Decode = single-user end-to-end throughput refreshed 2026-06-06 against rapid-mlx v0.6.80. ¹ Carried over from the 2026-04 bench (not re-measured this round).}

Run your own: bash evals/run_all_models.sh runs the full quality suite (tool calling, coding, reasoning, general) across every alias and emits a fresh evals/SCORECARD.md.

Features

Feature	What it does
Tool calling	OpenAI-compatible, 17 parser formats, automatic recovery when quantized models break (4-bit Qwen, Gemma, Llama, etc.).
Reasoning separation	Models with chain-of-thought (Qwen3, DeepSeek-R1, MiniMax, GPT-OSS) emit reasoning in a separate `reasoning_content` field, cleanly streamed alongside `content`. Casual chat + tool-call requests auto-disable thinking for lower TTFT (0.8.x+).
Radix prefix cache	Multi-tenant shared system prompts converge on the same radix node — O(prefix_len) lookup vs. O(log N + LCP_scan) on hash-keyed caches. Default index is `radix`; `--prefix-cache-index hash` for regressions. Always on; disable with `--disable-prefix-cache`.
RNN prompt cache	For hybrid models (Qwen3.5 DeltaNet), RNN state snapshots restore non-trimmable layers in ~0.1 ms instead of re-running the recurrent stack. 2–5× faster TTFT vs. mlx-lm on the DeltaNet aliases.
PFlash prefill acceleration	Sparse prefill on 32K+ prompts (attention sink + recent tail + query-relevant middle) — 3.87–8.5× TTFT on cold long prompts with full needle-in-a-haystack recall. Default-on for verified aliases; `--pflash always` forces it.
TurboQuant K8V4 KV codec	K is 8-bit + V is 4-bit after Walsh-Hadamard rotation + Lloyd-Max quantization — compresses KV to ~1/2.4 (~58% savings), lossless across the verified matrix. Default-on for 9 verified Qwen3.5/3.6 aliases (see below); `--kv-cache-turboquant none` to force off.
Smart cloud routing	Large-context requests auto-route to a cloud LLM (`--cloud-model openai/gpt-5 --cloud-threshold 20000`) when local prefill would be slow.
Multimodal	Vision, audio (STT/TTS with bundled Silero VAD silence pre-trim), video understanding, text embeddings — all through the standard OpenAI endpoints.
Speculative decoding	DFlash (block-diffusion drafter, `--enable-dflash`), MTP (multi-token prediction head, `--enable-mtp`), SuffixDecoding (drafter-free n-gram, `--suffix-decoding`). Opt-in per alias — see below.
Structured output, logprobs, continuous batching, KV quantization	Standard, no flags or with one flag each.
3300+ tests	Across reasoning, tool parsing, streaming, agent harnesses, and engine integration.

K8V4 default-on aliases (9 as of 0.9.9): qwen3.5-9b-4bit, qwen3.5-9b-8bit, qwen3.5-27b-4bit, qwen3.5-27b-8bit, qwen3.5-35b-6bit, qwen3.6-35b-4bit, qwen3.6-35b-6bit, qwen3.6-35b-8bit, qwen3.6-35b-dwq. Radix prefix cache is independent and saves additional bytes proportional to inter-request prefix overlap — the two are orthogonal.

Speculative decoding — DFlash, MTP, SuffixDecoding

Three drafters ship in-tree. All are opt-in per alias — rapid-mlx info <alias> shows what's eligible on the model you're serving.

DFlash — block-diffusion drafter (via mlx-vlm), single-user path. Opt in with --enable-dflash; requires pip install 'rapid-mlx[dflash]'.

Alias	Drafter	Avg speedup	Min / Max
`qwen3.6-27b-8bit`	`z-lab/Qwen3.6-27B-DFlash`	1.49×	1.06× / 2.07×
`qwen3.5-27b-8bit`	`z-lab/Qwen3.5-27B-DFlash`	1.31×	0.59× / 2.15×

pip install 'rapid-mlx[dflash]'
rapid-mlx info qwen3.5-27b-8bit       # check per-gate eligibility
rapid-mlx serve qwen3.5-27b-8bit --enable-dflash

Workload sensitivity: coding / math / summarization typically see 1.5–2.7×; high-entropy creative writing and long-form chat can dip to 0.6–0.9× because the drafter's training distribution diverges from open-ended generation — a known spec-decode literature pattern (AdaEDL), not a bug. DFlash mode runs a dedicated single-user server (no batched kernel yet); tool calling, MCP, and embeddings aren't available in that mode — restart without --enable-dflash for those.

MTP (Multi-Token Prediction) — draft head baked into the model. Opt in with --enable-mtp on MTP-trained checkpoints. Auto-tune of --mtp-num-draft-tokens per workload is roadmapped for the 0.9.1x line.

SuffixDecoding — drafter-free, statistical n-gram lookup over the running suffix. Opt in with --suffix-decoding; works on any BatchedEngine alias.

Mutex: DFlash cannot combine with MTP or SuffixDecoding (single-user path). MTP and SuffixDecoding cannot combine with each other (both consume the drafter slot).

Server flags reference

You don't need any flags to get started — the defaults work for most setups. These are for advanced tuning.

Core

Flag	Description	Default
`<model>`	HuggingFace model name, local path, or alias (positional arg)	(required)
`--host`	Host to bind to (loopback-only by default; pass `0.0.0.0` to expose on LAN)	`127.0.0.1`
`--port`	Port to bind to	`8000`
`--max-tokens`	Default max tokens for generation	`32768`

Tool calling & reasoning

Flag	Description	Default
`--tool-call-parser`	Parser: `hermes`, `minimax`, `qwen`, `llama`, `deepseek`, etc.	(auto-detected)
`--reasoning-parser`	Parser: `qwen3`, `deepseek_r1`, `minimax`, `gpt_oss`, `harmony`, `glm4`, `gemma4`	(auto-detected)
`--no-thinking` / `--no-think`	Disable reasoning even if auto-detected; faster, no `<think>` traces	off
`--enable-tool-logits-bias`	Jump-forward decoding for faster tool calls	off

Performance

Flag	Description	Default
`--prefill-step-size`	Tokens per prefill chunk	`2048`
`--kv-cache-turboquant`	TurboQuant KV-cache codec (`k8v4` = K 8-bit + V 4-bit, ~1/2.4 compression; `v4` = legacy V-only; `none` = off). Default-on for 9 verified Qwen3.5/3.6 aliases.	auto (`k8v4` on verified aliases, else off)
`--kv-cache-quantization`	Quantize prefix cache entries for memory savings	off
`--enable-prefix-cache` / `--disable-prefix-cache`	Cache common prefixes across requests	on
`--prefix-cache-index`	Prefix-cache lookup index: `radix` (default) or `hash`	`radix`
`--pflash`	PFlash sparse prefill: `auto` / `always` / `off`	`auto` (on for verified aliases)
`--enable-dflash`	DFlash speculative decoding (single-user; `qwen3.5-27b-8bit` / `qwen3.6-27b-8bit`)	off
`--suffix-decoding`	Drafter-free n-gram speculative decoding (BatchedEngine path)	off
`--enable-mtp`	MTP head speculative decoding (requires MTP-trained model)	off
`--gpu-memory-utilization`	Fraction of device memory to use (0.0-1.0)	`0.90`

Cloud routing

Flag	Description	Default
`--cloud-model`	litellm model string (e.g. `openai/gpt-5`)	(disabled)
`--cloud-threshold`	New token threshold to trigger cloud routing	`20000`

Security & other

Flag	Description	Default
`--api-key`	API key for authentication	(no auth)
`--rate-limit`	Requests per minute per client	(unlimited)
`--timeout`	Request timeout in seconds	`1800`
`--mllm` / `--no-mllm`	Force / disable multimodal (vision) mode	auto-detect
`--force-openai-harmony-streaming`	Force the openai-harmony streaming router on (debug-only)	auto-detect
`--no-openai-harmony-streaming`	Disable the openai-harmony streaming router; fall back to the legacy state machine	auto-detect
`--mcp-config`	MCP configuration file for tool integration	(none)
`--embedding-model`	Pre-load embedding model at startup	(none)

Optional Extras

The base pip install rapid-mlx is ~460 MB and covers all text-only models. Vision, audio, and other features ship as opt-in extras:

Extra	Install	Adds	What it unlocks
`vision`	`pip install 'rapid-mlx[vision]'`	~322 MB	Gemma 4, Qwen-VL, DiffusionGemma, video understanding (mlx-vlm + opencv + torch)
`audio`	`pip install 'rapid-mlx[audio]'`	~600 MB	26 TTS/STT aliases (Kokoro, Chatterbox, VibeVoice, VoxCPM, Dia, Whisper, Parakeet) via `/v1/audio/speech` and `/v1/audio/transcriptions`; bundles Silero VAD for silence pre-trim (guards Whisper against silence hallucination)
`embeddings`	`pip install 'rapid-mlx[embeddings]'`	~50 MB	`/v1/embeddings` endpoint (mlx-embeddings)
`chat`	`pip install 'rapid-mlx[chat]'`	~150 MB	Built-in Gradio chat UI
`guided`	`pip install 'rapid-mlx[guided]'`	~80 MB	Schema-constrained JSON generation (outlines)
`dflash`	`pip install 'rapid-mlx[dflash]'`	~50 MB	DFlash speculative decoding runtime (text-only, no torch)
`all`	`pip install 'rapid-mlx[all]'`	~1.1 GB	Vision + audio + chat + embeddings

If you installed via Homebrew, install extras into the brew-managed Python so the rapid-mlx binary picks them up:

$(brew --prefix)/opt/rapid-mlx/libexec/bin/pip install 'rapid-mlx[vision]'   # or [audio], [embeddings], ...

(A separate venv-local pip install 'rapid-mlx[vision]' also works, but only affects the rapid-mlx inside that venv — the brew-installed binary keeps its own libexec Python.)

Troubleshooting

Run the built-in self-diagnostic (works from pip install, no dev tools needed):

rapid-mlx doctor

Rapid-MLX Doctor
============================================================
  [metal] OK        # Apple Silicon Metal GPU available
  [imports] OK      # Core modules import cleanly
  [cli] OK          # CLI commands respond
  [model_load] OK   # Inference pipeline works
Result: PASS

Common gotchas:

Getting much lower tok/s than the speed table? Most often the model is reasoning before answering. Qwen3.5 and Qwen3.6 default to thinking-on, which doubles the work. Add --no-think to chat or serve to skip chain-of-thought. (This is the cause of the most common bug report — see issue #567.)
Out of memory or very slow (<5 tok/s). Model too big for your RAM. Check What fits my Mac? and pick a smaller quant (4-bit) or smaller model.
Empty responses. Remove --reasoning-parser for non-thinking models.
Tool calls coming through as plain text. Set the right --tool-call-parser for your model, or rely on auto-detection. Even without it, Rapid-MLX auto-recovers most cases.
Slow first response. Two causes: (1) Qwen3.5/3.6 think before answering — add --no-think; (2) cold prefill on long prompts — add --prefill-step-size 8192. Subsequent turns hit prompt cache and are 10–30× faster.
"parameters not found in model" warnings at startup. Normal for VLMs — vision weights are auto-skipped.

Shell completion

Tab completion works automatically when installed via Homebrew. For pip / install.sh, activate it once:

# zsh / bash
eval "$(register-python-argcomplete rapid-mlx)"

# Or system-wide:
activate-global-python-argcomplete

Telemetry

Rapid-MLX can send anonymous usage data to help us prioritise the right models and catch regressions. It is off by default and never starts collecting without your explicit opt-in.

What we collect (only if you opt in)

Subcommand names (serve / chat / agents / bench / doctor)
Model alias names (qwen3.5-9b-4bit) or canonical HF repo IDs (mlx-community/...) — local paths are redacted to <local>
Bucketed counts: prompt/completion tokens, TTFT, tokens/sec — never exact values
Error categories + a hash fingerprint of the failure site (exception class name + per-frame file:function:lineno only — never the message text or absolute paths)
OS, arch, Apple chip name, RAM (rounded to GB), Python major.minor

What we never collect

Prompts, completions, tool-call arguments, file contents, or any user-generated text
Local file paths, working directory, or model paths beyond their HF repo ID
IPs or hostnames (Phase 2 routes through a Cloudflare Worker that strips IPs before forwarding; Phase 1 ships no transport at all)
API keys, environment variable values, auth headers
Stack trace messages or argument values

Manage it

rapid-mlx telemetry status     # show current state and why
rapid-mlx telemetry preview    # print the exact JSON payload that would be sent
rapid-mlx telemetry enable     # opt in
rapid-mlx telemetry disable    # opt out
rapid-mlx telemetry reset      # delete consent + client-id files (re-prompts on next run)

Force-disable in scripts / CI

Either of these always wins, regardless of stored consent:

RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b-4bit
rapid-mlx --no-telemetry serve qwen3.5-9b-4bit

There is intentionally no env-var equivalent for force-on — opting in must be an explicit one-time rapid-mlx telemetry enable. CI agents will never silently contribute. Code lives in vllm_mlx/telemetry/; tracking issue #236.

Development

git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX
pip install -e ".[dev]"

Test commands:

Command	What	Time	Needs server?
`make lint`	ruff lint	~10s	No
`make test`	pytest unit suite (3300+ tests)	~30s	No
`make smoke`	lint + unit	~1 min	No
`make stress`	8-scenario stress test	~5 min	Yes
`make soak`	10-min agent soak test	10 min	Yes
`make check`	1-model regression harness (auto starts server)	~10 min	auto
`make full`	3 models + 12 agent profiles	~1 hr	auto

For stress/soak, start a server first:

rapid-mlx serve qwen3.5-4b-4bit
# In another terminal:
make stress

Or use the script directly for more options: python scripts/dev_test.py {smoke,stress,full}.

Architecture overview

vllm_mlx/
  server.py              # App factory + model loading + CLI entry
  config/                # ServerConfig singleton
  service/
    helpers.py           # Shared request helpers
    postprocessor.py     # Streaming pipeline (100% test coverage)
  routes/
    chat.py              # /v1/chat/completions
    completions.py       # /v1/completions
    responses.py         # /v1/responses (Codex CLI)
    anthropic.py         # /v1/messages (Anthropic API)
    health.py, models.py, embeddings.py, audio.py, mcp_routes.py
  engine/                # BatchedEngine (continuous batching)
  reasoning/             # 7 reasoning parsers (Qwen3, DeepSeek, MiniMax, ...)
  tool_parsers/          # 17 tool call parsers
  speculative/           # DFlash, SuffixDecoding, MTP drafters
  agents/                # 12 agent profiles (YAML)
  runtime/               # Model registry, cache persistence
  doctor/                # User self-diagnostic
scripts/                 # Dev-only (NOT shipped with pip)
tests/                   # pytest unit tests (3300+)
harness/                 # Regression baselines + thresholds

Roadmap

Technique	Expected Gain	Status
DFlash — block-diffusion drafter, single-user	1.3–2× decode (workload-dependent)	Shipping, opt-in (`--enable-dflash`, `[dflash]` extra; qwen3.5-27b-8bit, qwen3.6-27b-8bit)
SuffixDecoding — drafter-free n-gram speculative	1.1–1.5× decode	Shipping (`--suffix-decoding`, per-model tier sweep ongoing)
MTP — Multi-Token Prediction head	1.4–1.7× decode	Shipping, opt-in (`--enable-mtp`, MTP-trained checkpoints); per-workload auto-tune of `--mtp-num-draft-tokens` landing in the 0.9.1x line
EAGLE-3 — feature-level draft on Metal	3–6.5× decode	Not started
ReDrafter — Apple's RNN draft head	1.4–1.5× decode	Not started

See ROADMAP.md for the full backlog.

Contributing

We welcome contributions of all sizes! See CONTRIBUTING.md for setup and guidelines.

Easy first contributions (no model download needed):

Add a model alias — map a short name to a HuggingFace model ID
Request model support — tell us which model you want

Testing contributions (needs a Mac with Apple Silicon):

Share your hardware's benchmark numbers — one command:
```
rapid-mlx bench qwen3.5-9b-4bit --submit
```
Runs the standardized B=1 bench (greedy, 128 + 512 token buckets, 5 rounds each), shows you the JSON payload, asks for consent, and opens the PR for you via gh. If you don't have gh, it prints the JSON path + a deep-link to GitHub's compare page so you can open the PR in your browser. Submitted rows land in community-benchmarks/submissions/ and show up on https://rapidmlx.com once merged.
Test with your favorite AI client (Cursor, Aider, LangChain, etc.)
Report a bug

Contributors

Star History

License

Apache 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raullenchai

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.12

Jul 3, 2026

0.9.11

Jul 3, 2026

0.9.10

Jul 1, 2026

0.9.9

Jun 30, 2026

0.9.8

Jun 28, 2026

0.9.7

Jun 26, 2026

0.9.6

Jun 26, 2026

0.9.5

Jun 26, 2026

0.9.4

Jun 26, 2026

0.9.3

Jun 26, 2026

0.9.2

Jun 26, 2026

0.9.1

Jun 26, 2026

0.9.0

Jun 26, 2026

0.8.19

Jun 25, 2026

0.8.18

Jun 24, 2026

0.8.16

Jun 24, 2026

0.8.15

Jun 24, 2026

0.8.14

Jun 24, 2026

0.8.13

Jun 24, 2026

0.8.12

Jun 23, 2026

0.8.11

Jun 23, 2026

0.8.10

Jun 22, 2026

0.8.9

Jun 22, 2026

0.8.8

Jun 22, 2026

0.8.7

Jun 22, 2026

0.8.6

Jun 22, 2026

0.8.5

Jun 22, 2026

0.8.4

Jun 21, 2026

0.8.3

Jun 21, 2026

0.8.2

Jun 21, 2026

0.8.1

Jun 21, 2026

0.8.0

Jun 20, 2026

0.7.41

Jun 19, 2026

0.7.37

Jun 19, 2026

0.7.32

Jun 19, 2026

0.7.31

Jun 18, 2026

0.7.29

Jun 18, 2026

0.7.28

Jun 18, 2026

0.7.27

Jun 17, 2026

0.7.26

Jun 17, 2026

0.7.20

Jun 16, 2026

0.7.15

Jun 16, 2026

0.7.13

Jun 16, 2026

0.7.12

Jun 16, 2026

0.7.11

Jun 16, 2026

0.7.9

Jun 15, 2026

0.7.7

Jun 15, 2026

0.7.6

Jun 15, 2026

0.7.3

Jun 12, 2026

0.7.2

Jun 11, 2026

0.7.1

Jun 11, 2026

0.7.0

Jun 10, 2026

0.6.83

Jun 9, 2026

0.6.82

Jun 7, 2026

0.6.81

Jun 7, 2026

0.6.80

Jun 6, 2026

0.6.79

Jun 6, 2026

0.6.78

Jun 6, 2026

0.6.77

Jun 6, 2026

0.6.76

Jun 5, 2026

0.6.75

Jun 5, 2026

0.6.74

Jun 4, 2026

0.6.73

Jun 4, 2026

0.6.72

Jun 3, 2026

0.6.71

Jun 1, 2026

0.6.70

Jun 1, 2026

0.6.69

May 30, 2026

0.6.68

May 28, 2026

0.6.66

May 23, 2026

0.6.65

May 22, 2026

0.6.64

May 21, 2026

0.6.63

May 21, 2026

0.6.62

May 21, 2026

0.6.61

May 20, 2026

0.6.60

May 20, 2026

0.6.59

May 19, 2026

0.6.58

May 19, 2026

0.6.57

May 19, 2026

0.6.56

May 18, 2026

0.6.55

May 18, 2026

0.6.54

May 18, 2026

0.6.53

May 18, 2026

0.6.52

May 17, 2026

0.6.51

May 16, 2026

0.6.50

May 15, 2026

0.6.49

May 14, 2026

0.6.48

May 14, 2026

0.6.47

May 14, 2026

0.6.46

May 13, 2026

0.6.45

May 13, 2026

0.6.44

May 13, 2026

0.6.43

May 13, 2026

0.6.42

May 13, 2026

0.6.41

May 13, 2026

0.6.40

May 13, 2026

0.6.39

May 13, 2026

0.6.38

May 13, 2026

0.6.37

May 12, 2026

0.6.36

May 12, 2026

0.6.35

May 11, 2026

0.6.34

May 11, 2026

0.6.33

May 10, 2026

0.6.32

May 10, 2026

0.6.31

May 10, 2026

0.6.30

May 9, 2026

0.6.29

May 9, 2026

0.6.28

May 8, 2026

0.6.27

May 8, 2026

0.6.26

May 8, 2026

0.6.23

May 7, 2026

0.6.20

May 7, 2026

0.6.19

May 7, 2026

0.6.18

May 7, 2026

0.6.16

May 7, 2026

0.6.15

May 6, 2026

0.6.14

May 5, 2026

0.6.13

May 5, 2026

0.6.12

May 5, 2026

0.6.11

May 4, 2026

0.6.10

May 4, 2026

0.6.9

May 4, 2026

0.6.8

May 3, 2026

0.6.7

May 3, 2026

0.6.6

May 1, 2026

0.6.5

May 1, 2026

0.6.4

Apr 29, 2026

0.6.3

Apr 29, 2026

0.6.2

Apr 28, 2026

0.6.1

Apr 22, 2026

0.6.0

Apr 20, 2026

0.5.10

Apr 17, 2026

0.5.9

Apr 17, 2026

0.5.8

Apr 17, 2026

0.5.7

Apr 17, 2026

0.5.6

Apr 17, 2026

0.5.5

Apr 17, 2026

0.5.4

Apr 17, 2026

0.5.3

Apr 17, 2026

0.5.2

Apr 17, 2026

0.5.1

Apr 15, 2026

0.5.0

Apr 14, 2026

0.4.3

Apr 12, 2026

0.4.2

Apr 7, 2026

0.4.1

Apr 7, 2026

0.4.0

Apr 6, 2026

0.3.12

Mar 22, 2026

0.3.11

Mar 21, 2026

0.3.10

Mar 21, 2026

0.3.9

Mar 21, 2026

0.3.8

Mar 21, 2026

0.3.7

Mar 21, 2026

0.3.6

Mar 21, 2026

0.3.5

Mar 21, 2026

0.3.4

Mar 21, 2026

0.3.3

Mar 21, 2026

0.3.2

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rapid_mlx-0.9.12.tar.gz (3.7 MB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rapid_mlx-0.9.12-py3-none-any.whl (1.9 MB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file rapid_mlx-0.9.12.tar.gz.

File metadata

Download URL: rapid_mlx-0.9.12.tar.gz
Upload date: Jul 3, 2026
Size: 3.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapid_mlx-0.9.12.tar.gz
Algorithm	Hash digest
SHA256	`310ef026538217df792c30f6638a4665a58f45d274e2d13b64174a5fa9ae5c80`
MD5	`0e8dc1fcbfe289f4f69c6ea10d91a4d7`
BLAKE2b-256	`f72b2dcc8e17a0fb5625e4abd7cf110b536475a0b792d649f5ef8026d08f85d6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapid_mlx-0.9.12.tar.gz:

Publisher: publish.yml on raullenchai/Rapid-MLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rapid_mlx-0.9.12.tar.gz
- Subject digest: 310ef026538217df792c30f6638a4665a58f45d274e2d13b64174a5fa9ae5c80
- Sigstore transparency entry: 2055605916
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: raullenchai/Rapid-MLX@82ab6d2453a574b50201ff64c15cd602d96b0e20
- Branch / Tag: refs/tags/v0.9.12
- Owner: https://github.com/raullenchai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@82ab6d2453a574b50201ff64c15cd602d96b0e20
- Trigger Event: release

File details

Details for the file rapid_mlx-0.9.12-py3-none-any.whl.

File metadata

Download URL: rapid_mlx-0.9.12-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 1.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapid_mlx-0.9.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9fe35760ea5c465920c90288700b514fedce602777f95dc3dc04d36ceec104b1`
MD5	`431860ff2ff2a78f7260f0a598ad2bdd`
BLAKE2b-256	`c4cecbf7179589e857e4653c7f8267dd8bb83c4233e3b559e7f064573fce9c5f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapid_mlx-0.9.12-py3-none-any.whl:

Publisher: publish.yml on raullenchai/Rapid-MLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rapid_mlx-0.9.12-py3-none-any.whl
- Subject digest: 9fe35760ea5c465920c90288700b514fedce602777f95dc3dc04d36ceec104b1
- Sigstore transparency entry: 2055606349
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: raullenchai/Rapid-MLX@82ab6d2453a574b50201ff64c15cd602d96b0e20
- Branch / Tag: refs/tags/v0.9.12
- Owner: https://github.com/raullenchai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@82ab6d2453a574b50201ff64c15cd602d96b0e20
- Trigger Event: release

rapid-mlx 0.9.12

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Rapid-MLX

What runs on your Mac

Quick Start

Why Rapid-MLX

Use Cases

1. Chat in the terminal

2. OpenAI-compatible server for your apps

3. Agent backends: Codex CLI, Claude Code, Aider, OpenCode

4. Benchmark your hardware and contribute to the community

Bonus: share a public URL

Connect Your Client

Tested integrations

Quick setup snippets

Model-Harness Index (MHI)

Command Reference

Choose Your Model

What fits my Mac?

Alias naming convention

Full model lineup

Copy-paste serve commands

Benchmarks

Features

Optional Extras

Troubleshooting

Shell completion

Telemetry

What we collect (only if you opt in)

What we never collect

Manage it

Force-disable in scripts / CI

Development

Roadmap

Contributing

Contributors

Star History

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance