Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac
Project description
Native macOS AI inference — local models, remote endpoints, zero config
Website · Panel Changelog · Engine Changelog · Documentation
What is vMLX?
vMLX is a native macOS application for running AI models on Apple Silicon. It bundles a custom inference engine with a full-featured desktop interface — manage sessions, chat with models, download from HuggingFace, connect to remote APIs, and use agentic tool-calling workflows.
- Local inference with GPU acceleration via MLX
- Remote endpoints — connect to any OpenAI-compatible API
- HuggingFace downloader — search, download, and serve models in-app
- Built-in tools — file I/O, shell, search, image reading, ask_user interrupt
- MCP integration — Model Context Protocol tool servers (local sessions)
Key Features
Inference Engine (v0.2.18)
| Feature | Description |
|---|---|
| Paged KV Cache | Memory-efficient caching with prefix sharing and block-level reuse |
| KV Cache Quantization | Q4/Q8 quantized cache storage (2–4× memory savings) |
| Prefix Cache | Token-level prefix matching for fast prompt reuse across requests |
| Continuous Batching | Concurrent request handling with slot management |
| VLM Caching | Full KV cache pipeline for vision-language models (Qwen-VL, Gemma 3, etc.) |
| Mamba Hybrid Support | Auto-detects mixed KVCache + MambaCache models (Qwen3.5-VL, Qwen3-Coder-Next, Nemotron) |
| Streaming Detokenizer | Per-request UTF-8 buffering — emoji, CJK, Arabic render correctly |
| Request Cancellation | Stop inference mid-stream via API or connection close |
| OpenAI-Compatible API | Chat Completions + Responses API with full streaming support |
| Speculative Decoding | Draft model acceleration (20-90% speedup, zero quality loss) |
Desktop App (Panel v1.2.1)
| Feature | Description |
|---|---|
| Multi-session | Run multiple models simultaneously on different ports |
| Remote endpoints | Connect to OpenAI, Groq, local vLLM, or any compatible API |
| HuggingFace browser | Search, download, and install MLX models with progress tracking |
| Agentic tools | File I/O, shell, search, image reading with auto-continue loops (up to 10 iterations) |
| Per-chat settings | Temperature, Top P/K, Min P, Repeat Penalty, Stop Sequences, Max Tokens |
| Reasoning display | Collapsible thinking sections for Qwen3, DeepSeek-R1, GLM-4.7 |
| Tool parsers | hermes, pythonic, llama3, mistral, minimax, qwen3, nemotron, step3p5, and more |
| Auto-detection | Reads model config JSON for automatic parser and cache type selection |
| Persistent history | SQLite-backed chat history with metrics, tool calls, and reasoning content |
| Live metrics | TTFT, tokens/sec, prompt processing speed, prefix cache hits |
Quick Start
Desktop App (recommended)
# Clone and build
git clone https://github.com/vmlxllm/vmlx.git
cd vmlx/panel
# Install dependencies
npm install
# Development mode
npm run dev
# Build and install to /Applications
bash scripts/build-and-install.sh
Engine Only (CLI)
# Install
uv tool install git+https://github.com/vmlxllm/vmlx.git
# or
pip install git+https://github.com/vmlxllm/vmlx.git
# Start server
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# With continuous batching
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
# With API key
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-key
# With speculative decoding (20-90% faster)
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 \
--speculative-model mlx-community/Llama-3.2-1B-Instruct-4bit \
--num-draft-tokens 3
Use with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
API Endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat Completions API (streaming) |
POST /v1/responses |
Responses API (streaming) |
GET /v1/models |
List loaded models |
GET /health |
Server health + model info |
POST /v1/mcp/execute |
Execute MCP tool |
GET /v1/cache/stats |
Prefix cache statistics |
POST /v1/cache/warm |
Pre-warm cache with prompt |
DELETE /v1/cache |
Clear prefix cache |
POST /v1/chat/completions/{id}/cancel |
Cancel inference (save GPU) |
POST /v1/embeddings |
Text embeddings (mlx-embeddings) |
Reasoning Models
Extract thinking process from reasoning-capable models:
vmlx-engine serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
| Parser | Models | Format |
|---|---|---|
qwen3 |
Qwen3, QwQ, MiniMax M2/M2.5, StepFun | <think> / </think> tags |
deepseek_r1 |
DeepSeek-R1, Gemma 3, Phi-4 Reasoning, GLM-4.7, GLM-Z1 | Lenient <think> (handles missing open tag) |
openai_gptoss |
GLM-4.7 Flash, GPT-OSS | Harmony <|channel|>analysis/final protocol |
Tool Calling
Built-in agentic tools available in the desktop app:
| Category | Tools |
|---|---|
| File | read_file, write_file, edit_file, patch_file, batch_edit, copy, move, delete, create_directory, list_directory, read_image |
| Search | search_files, find_files, file_info, get_diagnostics, get_tree, diff_files |
| Shell | run_command, spawn_process, get_process_output |
| Web | fetchUrl, brave_search |
| Utility | ask_user (interactive interrupt) |
Plus MCP tool server passthrough for local sessions.
Architecture
┌─────────────────────────────────────────────────────────┐
│ vMLX Desktop App │
│ (Electron + React + TypeScript) │
└─────────────────────────────────────────────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Local vmlx-engine │ │ Remote Endpoints │
│ (spawned process) │ │ (OpenAI, Groq, etc.) │
└──────────────────────┘ └──────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ vMLX Engine │
│ (FastAPI + MLX inference + caching) │
└─────────────────────────────────────────────────────────┘
│
┌─────────┼──────────┬──────────┐
▼ ▼ ▼ ▼
┌────────┐┌────────┐┌────────┐┌────────────┐
│ mlx-lm ││mlx-vlm ││mlx-aud ││mlx-embed │
│ (LLMs) ││(Vision)││(Audio) ││(Embeddings)│
└────────┘└────────┘└────────┘└────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Apple MLX │
│ (Metal GPU + Unified Memory) │
└─────────────────────────────────────────────────────────┘
Tech Stack
| Layer | Technology |
|---|---|
| Desktop app | Electron 28 + React 18 + TypeScript |
| Styling | Tailwind CSS |
| Database | SQLite (WAL mode, better-sqlite3) |
| Inference engine | vMLX Engine v0.2.18 (Python, FastAPI) |
| ML framework | Apple MLX (Metal GPU acceleration) |
| Build | electron-vite + electron-builder |
| Tests | Vitest (panel: 542 tests), pytest (engine: 1595 tests) |
| Python | Bundled relocatable Python 3.12 |
Recent Changes
Panel v1.2.1 / Engine v0.2.18 (2026-03-09)
- Tool calling fix:
enableAutoToolChoicedefault changed fromfalsetoundefined(auto-detect) — MCP and built-in tools now work out of the box without manual enable - MCP tool result truncation: MCP tool results now capped at same limit as built-in tools (50KB default) to prevent context overflow
- Command preview parity:
buildCommandPreviewin SessionSettings now matches actualbuildArgslogic for auto-tool-choice flags - Old config migration: Stored sessions with
enableAutoToolChoice: falseauto-migrate toundefinedon load - 2137 total tests: 1595 engine + 542 panel (12 new regression tests for tool calling and MCP)
Panel v1.2.0 / Engine v0.2.18 (2026-03-09)
- HuggingFace download fix: Download progress no longer stuck at 0% — tqdm
\rchunk splitting, ANSI stripping, highest-percent extraction - HF browser NaN/Unknown fix: Model ages and authors display correctly (uses
createdAtfallback, extracts author from modelId) - macOS 15 launch fix:
minimumSystemVersioncorrected from 26.0.0 to 14.0.0 (fixes GitHub #10) - Deep stability audit: 14 fixes across paged cache block lifecycle, KV dequantize safety, reasoning marker detection, tool fallback, Mistral JSON validation
- CancelledError SSE hang: Engine cancellation now unblocks all waiting SSE consumers
- 2125 total tests: 1595 engine + 530 panel with full regression coverage
Panel v1.1.4 / Engine v0.2.12 (2026-03-07)
- tool_choice="none" fix: Content no longer swallowed when tool markers detected with tools suppressed
- suppress_reasoning: Reasoning leaks plugged in both API paths
- First-launch UX: Auto-creates initial chat, dynamic About page version
- 1571 engine tests, 530 panel tests across 6 vitest suites
See Panel Changelog and Engine Changelog for full history.
Current Version
Engine v0.2.18 / Panel v1.2.1 — macOS 26+ (Tahoe) for local inference, macOS 14+ for remote endpoints. Apple Silicon (M1, M2, M3, M4)
Links
- Website: vmlx.net
- Contact: admin@vmlx.net
License
Apache 2.0 — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vmlx-1.0.0.tar.gz.
File metadata
- Download URL: vmlx-1.0.0.tar.gz
- Upload date:
- Size: 572.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfb71b7b310a8f78842aad51044111da3bdf142a524048d4633da74941504dc4
|
|
| MD5 |
3d5dca0ef7b6da820f8d951bdd5513ee
|
|
| BLAKE2b-256 |
aa29bbd1313278786673fd31db6b758ff8388042762e8fcce63cf931ebeec103
|
File details
Details for the file vmlx-1.0.0-py3-none-any.whl.
File metadata
- Download URL: vmlx-1.0.0-py3-none-any.whl
- Upload date:
- Size: 404.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4c0a39f4b849dce40a3ce8ea4973f4fe50f5c14c826d8902b671524a0f24158
|
|
| MD5 |
56d9470e00896e46ac952cd704bc96b3
|
|
| BLAKE2b-256 |
82bb249e9eaa5f15c6a0018daf90c24c6b1fc37bb0b537a0b2748a5a72513feb
|