Skip to main content

Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac

Project description

vMLX

Native macOS AI inference — local models, remote endpoints, zero config

Website · Panel Changelog · Engine Changelog · Documentation


What is vMLX?

vMLX is a native macOS application for running AI models on Apple Silicon. It bundles a custom inference engine with a full-featured desktop interface — manage sessions, chat with models, download from HuggingFace, connect to remote APIs, and use agentic tool-calling workflows.

  • Local inference with GPU acceleration via MLX
  • Remote endpoints — connect to any OpenAI-compatible API
  • HuggingFace downloader — search, download, and serve models in-app
  • Built-in tools — file I/O, shell, search, image reading, ask_user interrupt
  • MCP integration — Model Context Protocol tool servers (local sessions)

Key Features

Inference Engine (v0.2.18)

Feature Description
Paged KV Cache Memory-efficient caching with prefix sharing and block-level reuse
KV Cache Quantization Q4/Q8 quantized cache storage (2–4× memory savings)
Prefix Cache Token-level prefix matching for fast prompt reuse across requests
Continuous Batching Concurrent request handling with slot management
VLM Caching Full KV cache pipeline for vision-language models (Qwen-VL, Gemma 3, etc.)
Mamba Hybrid Support Auto-detects mixed KVCache + MambaCache models (Qwen3.5-VL, Qwen3-Coder-Next, Nemotron)
Streaming Detokenizer Per-request UTF-8 buffering — emoji, CJK, Arabic render correctly
Request Cancellation Stop inference mid-stream via API or connection close
OpenAI-Compatible API Chat Completions + Responses API with full streaming support
Speculative Decoding Draft model acceleration (20-90% speedup, zero quality loss)

Desktop App (Panel v1.2.1)

Feature Description
Multi-session Run multiple models simultaneously on different ports
Remote endpoints Connect to OpenAI, Groq, local vLLM, or any compatible API
HuggingFace browser Search, download, and install MLX models with progress tracking
Agentic tools File I/O, shell, search, image reading with auto-continue loops (up to 10 iterations)
Per-chat settings Temperature, Top P/K, Min P, Repeat Penalty, Stop Sequences, Max Tokens
Reasoning display Collapsible thinking sections for Qwen3, DeepSeek-R1, GLM-4.7
Tool parsers hermes, pythonic, llama3, mistral, minimax, qwen3, nemotron, step3p5, and more
Auto-detection Reads model config JSON for automatic parser and cache type selection
Persistent history SQLite-backed chat history with metrics, tool calls, and reasoning content
Live metrics TTFT, tokens/sec, prompt processing speed, prefix cache hits

Quick Start

Desktop App (recommended)

# Clone and build
git clone https://github.com/vmlxllm/vmlx.git
cd vmlx/panel

# Install dependencies
npm install

# Development mode
npm run dev

# Build and install to /Applications
bash scripts/build-and-install.sh

Engine Only (CLI)

# Install
uv tool install git+https://github.com/vmlxllm/vmlx.git
# or
pip install git+https://github.com/vmlxllm/vmlx.git

# Start server
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# With continuous batching
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-key

# With speculative decoding (20-90% faster)
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 \
  --speculative-model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --num-draft-tokens 3

Use with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat Completions API (streaming)
POST /v1/responses Responses API (streaming)
GET /v1/models List loaded models
GET /health Server health + model info
POST /v1/mcp/execute Execute MCP tool
GET /v1/cache/stats Prefix cache statistics
POST /v1/cache/warm Pre-warm cache with prompt
DELETE /v1/cache Clear prefix cache
POST /v1/chat/completions/{id}/cancel Cancel inference (save GPU)
POST /v1/embeddings Text embeddings (mlx-embeddings)

Reasoning Models

Extract thinking process from reasoning-capable models:

vmlx-engine serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
Parser Models Format
qwen3 Qwen3, QwQ, MiniMax M2/M2.5, StepFun <think> / </think> tags
deepseek_r1 DeepSeek-R1, Gemma 3, Phi-4 Reasoning, GLM-4.7, GLM-Z1 Lenient <think> (handles missing open tag)
openai_gptoss GLM-4.7 Flash, GPT-OSS Harmony <|channel|>analysis/final protocol

Tool Calling

Built-in agentic tools available in the desktop app:

Category Tools
File read_file, write_file, edit_file, patch_file, batch_edit, copy, move, delete, create_directory, list_directory, read_image
Search search_files, find_files, file_info, get_diagnostics, get_tree, diff_files
Shell run_command, spawn_process, get_process_output
Web fetchUrl, brave_search
Utility ask_user (interactive interrupt)

Plus MCP tool server passthrough for local sessions.


Architecture

┌─────────────────────────────────────────────────────────┐
│                    vMLX Desktop App                      │
│              (Electron + React + TypeScript)              │
└─────────────────────────────────────────────────────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
┌──────────────────────┐  ┌──────────────────────┐
│   Local vmlx-engine     │  │   Remote Endpoints   │
│   (spawned process)  │  │ (OpenAI, Groq, etc.) │
└──────────────────────┘  └──────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│                      vMLX Engine                             │
│         (FastAPI + MLX inference + caching)               │
└─────────────────────────────────────────────────────────┘
              │
    ┌─────────┼──────────┬──────────┐
    ▼         ▼          ▼          ▼
┌────────┐┌────────┐┌────────┐┌────────────┐
│ mlx-lm ││mlx-vlm ││mlx-aud ││mlx-embed   │
│ (LLMs) ││(Vision)││(Audio) ││(Embeddings)│
└────────┘└────────┘└────────┘└────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│                     Apple MLX                            │
│             (Metal GPU + Unified Memory)                 │
└─────────────────────────────────────────────────────────┘

Tech Stack

Layer Technology
Desktop app Electron 28 + React 18 + TypeScript
Styling Tailwind CSS
Database SQLite (WAL mode, better-sqlite3)
Inference engine vMLX Engine v0.2.18 (Python, FastAPI)
ML framework Apple MLX (Metal GPU acceleration)
Build electron-vite + electron-builder
Tests Vitest (panel: 542 tests), pytest (engine: 1595 tests)
Python Bundled relocatable Python 3.12

Recent Changes

Panel v1.2.1 / Engine v0.2.18 (2026-03-09)

  • Tool calling fix: enableAutoToolChoice default changed from false to undefined (auto-detect) — MCP and built-in tools now work out of the box without manual enable
  • MCP tool result truncation: MCP tool results now capped at same limit as built-in tools (50KB default) to prevent context overflow
  • Command preview parity: buildCommandPreview in SessionSettings now matches actual buildArgs logic for auto-tool-choice flags
  • Old config migration: Stored sessions with enableAutoToolChoice: false auto-migrate to undefined on load
  • 2137 total tests: 1595 engine + 542 panel (12 new regression tests for tool calling and MCP)

Panel v1.2.0 / Engine v0.2.18 (2026-03-09)

  • HuggingFace download fix: Download progress no longer stuck at 0% — tqdm \r chunk splitting, ANSI stripping, highest-percent extraction
  • HF browser NaN/Unknown fix: Model ages and authors display correctly (uses createdAt fallback, extracts author from modelId)
  • macOS 15 launch fix: minimumSystemVersion corrected from 26.0.0 to 14.0.0 (fixes GitHub #10)
  • Deep stability audit: 14 fixes across paged cache block lifecycle, KV dequantize safety, reasoning marker detection, tool fallback, Mistral JSON validation
  • CancelledError SSE hang: Engine cancellation now unblocks all waiting SSE consumers
  • 2125 total tests: 1595 engine + 530 panel with full regression coverage

Panel v1.1.4 / Engine v0.2.12 (2026-03-07)

  • tool_choice="none" fix: Content no longer swallowed when tool markers detected with tools suppressed
  • suppress_reasoning: Reasoning leaks plugged in both API paths
  • First-launch UX: Auto-creates initial chat, dynamic About page version
  • 1571 engine tests, 530 panel tests across 6 vitest suites

See Panel Changelog and Engine Changelog for full history.


Current Version

Engine v0.2.18 / Panel v1.2.1 — macOS 26+ (Tahoe) for local inference, macOS 14+ for remote endpoints. Apple Silicon (M1, M2, M3, M4)

Links

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmlx-1.0.0.tar.gz (572.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vmlx-1.0.0-py3-none-any.whl (404.5 kB view details)

Uploaded Python 3

File details

Details for the file vmlx-1.0.0.tar.gz.

File metadata

  • Download URL: vmlx-1.0.0.tar.gz
  • Upload date:
  • Size: 572.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vmlx-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cfb71b7b310a8f78842aad51044111da3bdf142a524048d4633da74941504dc4
MD5 3d5dca0ef7b6da820f8d951bdd5513ee
BLAKE2b-256 aa29bbd1313278786673fd31db6b758ff8388042762e8fcce63cf931ebeec103

See more details on using hashes here.

File details

Details for the file vmlx-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: vmlx-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 404.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vmlx-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4c0a39f4b849dce40a3ce8ea4973f4fe50f5c14c826d8902b671524a0f24158
MD5 56d9470e00896e46ac952cd704bc96b3
BLAKE2b-256 82bb249e9eaa5f15c6a0018daf90c24c6b1fc37bb0b537a0b2748a5a72513feb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page