vmlx · PyPI

Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac

These details have not been verified by PyPI

Project links

Project description

vMLX

Native macOS AI inference — local models, remote endpoints, zero config

Website · Panel Changelog · Engine Changelog · Documentation

What is vMLX?

vMLX is a native macOS application for running AI models on Apple Silicon. It bundles a custom inference engine with a full-featured desktop interface — manage sessions, chat with models, download from HuggingFace, connect to remote APIs, and use agentic tool-calling workflows.

Local inference with GPU acceleration via MLX
Remote endpoints — connect to any OpenAI-compatible API
HuggingFace downloader — search, download, and serve models in-app
Built-in tools — file I/O, shell, search, image reading, ask_user interrupt
MCP integration — Model Context Protocol tool servers (local sessions)

Key Features

Inference Engine (v0.2.18)

Feature	Description
Paged KV Cache	Memory-efficient caching with prefix sharing and block-level reuse
KV Cache Quantization	Q4/Q8 quantized cache storage (2–4× memory savings)
Prefix Cache	Token-level prefix matching for fast prompt reuse across requests
Continuous Batching	Concurrent request handling with slot management
VLM Caching	Full KV cache pipeline for vision-language models (Qwen-VL, Gemma 3, etc.)
Mamba Hybrid Support	Auto-detects mixed KVCache + MambaCache models (Qwen3.5-VL, Qwen3-Coder-Next, Nemotron)
Streaming Detokenizer	Per-request UTF-8 buffering — emoji, CJK, Arabic render correctly
Request Cancellation	Stop inference mid-stream via API or connection close
OpenAI-Compatible API	Chat Completions + Responses API with full streaming support
Speculative Decoding	Draft model acceleration (20-90% speedup, zero quality loss)

Desktop App (Panel v1.2.1)

Feature	Description
Multi-session	Run multiple models simultaneously on different ports
Remote endpoints	Connect to OpenAI, Groq, local vLLM, or any compatible API
HuggingFace browser	Search, download, and install MLX models with progress tracking
Agentic tools	File I/O, shell, search, image reading with auto-continue loops (up to 10 iterations)
Per-chat settings	Temperature, Top P/K, Min P, Repeat Penalty, Stop Sequences, Max Tokens
Reasoning display	Collapsible thinking sections for Qwen3, DeepSeek-R1, GLM-4.7
Tool parsers	hermes, pythonic, llama3, mistral, minimax, qwen3, nemotron, step3p5, and more
Auto-detection	Reads model config JSON for automatic parser and cache type selection
Persistent history	SQLite-backed chat history with metrics, tool calls, and reasoning content
Live metrics	TTFT, tokens/sec, prompt processing speed, prefix cache hits

Quick Start

Desktop App (recommended)

# Clone and build
git clone https://github.com/vmlxllm/vmlx.git
cd vmlx/panel

# Install dependencies
npm install

# Development mode
npm run dev

# Build and install to /Applications
bash scripts/build-and-install.sh

Engine Only (CLI)

# Install
uv tool install git+https://github.com/vmlxllm/vmlx.git
# or
pip install git+https://github.com/vmlxllm/vmlx.git

# Start server
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# With continuous batching
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-key

# With speculative decoding (20-90% faster)
vmlx-engine serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 \
  --speculative-model mlx-community/Llama-3.2-1B-Instruct-4bit \
  --num-draft-tokens 3

Use with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

API Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat Completions API (streaming)
`POST /v1/responses`	Responses API (streaming)
`GET /v1/models`	List loaded models
`GET /health`	Server health + model info
`POST /v1/mcp/execute`	Execute MCP tool
`GET /v1/cache/stats`	Prefix cache statistics
`POST /v1/cache/warm`	Pre-warm cache with prompt
`DELETE /v1/cache`	Clear prefix cache
`POST /v1/chat/completions/{id}/cancel`	Cancel inference (save GPU)
`POST /v1/embeddings`	Text embeddings (mlx-embeddings)

Reasoning Models

Extract thinking process from reasoning-capable models:

vmlx-engine serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

Parser	Models	Format
`qwen3`	Qwen3, QwQ, MiniMax M2/M2.5, StepFun	`<think>` / `</think>` tags
`deepseek_r1`	DeepSeek-R1, Gemma 3, Phi-4 Reasoning, GLM-4.7, GLM-Z1	Lenient `<think>` (handles missing open tag)
`openai_gptoss`	GLM-4.7 Flash, GPT-OSS	Harmony `<\|channel\|>analysis/final` protocol

Tool Calling

Built-in agentic tools available in the desktop app:

Category	Tools
File	read_file, write_file, edit_file, patch_file, batch_edit, copy, move, delete, create_directory, list_directory, read_image
Search	search_files, find_files, file_info, get_diagnostics, get_tree, diff_files
Shell	run_command, spawn_process, get_process_output
Web	fetchUrl, brave_search
Utility	ask_user (interactive interrupt)

Plus MCP tool server passthrough for local sessions.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    vMLX Desktop App                      │
│              (Electron + React + TypeScript)              │
└─────────────────────────────────────────────────────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
┌──────────────────────┐  ┌──────────────────────┐
│   Local vmlx-engine     │  │   Remote Endpoints   │
│   (spawned process)  │  │ (OpenAI, Groq, etc.) │
└──────────────────────┘  └──────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│                      vMLX Engine                             │
│         (FastAPI + MLX inference + caching)               │
└─────────────────────────────────────────────────────────┘
              │
    ┌─────────┼──────────┬──────────┐
    ▼         ▼          ▼          ▼
┌────────┐┌────────┐┌────────┐┌────────────┐
│ mlx-lm ││mlx-vlm ││mlx-aud ││mlx-embed   │
│ (LLMs) ││(Vision)││(Audio) ││(Embeddings)│
└────────┘└────────┘└────────┘└────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│                     Apple MLX                            │
│             (Metal GPU + Unified Memory)                 │
└─────────────────────────────────────────────────────────┘

Tech Stack

Layer	Technology
Desktop app	Electron 28 + React 18 + TypeScript
Styling	Tailwind CSS
Database	SQLite (WAL mode, better-sqlite3)
Inference engine	vMLX Engine v0.2.18 (Python, FastAPI)
ML framework	Apple MLX (Metal GPU acceleration)
Build	electron-vite + electron-builder
Tests	Vitest (panel: 542 tests), pytest (engine: 1595 tests)
Python	Bundled relocatable Python 3.12

Recent Changes

Panel v1.2.1 / Engine v0.2.18 (2026-03-09)

Tool calling fix: enableAutoToolChoice default changed from false to undefined (auto-detect) — MCP and built-in tools now work out of the box without manual enable
MCP tool result truncation: MCP tool results now capped at same limit as built-in tools (50KB default) to prevent context overflow
Command preview parity: buildCommandPreview in SessionSettings now matches actual buildArgs logic for auto-tool-choice flags
Old config migration: Stored sessions with enableAutoToolChoice: false auto-migrate to undefined on load
2137 total tests: 1595 engine + 542 panel (12 new regression tests for tool calling and MCP)

Panel v1.2.0 / Engine v0.2.18 (2026-03-09)

HuggingFace download fix: Download progress no longer stuck at 0% — tqdm \r chunk splitting, ANSI stripping, highest-percent extraction
HF browser NaN/Unknown fix: Model ages and authors display correctly (uses createdAt fallback, extracts author from modelId)
macOS 15 launch fix: minimumSystemVersion corrected from 26.0.0 to 14.0.0 (fixes GitHub #10)
Deep stability audit: 14 fixes across paged cache block lifecycle, KV dequantize safety, reasoning marker detection, tool fallback, Mistral JSON validation
CancelledError SSE hang: Engine cancellation now unblocks all waiting SSE consumers
2125 total tests: 1595 engine + 530 panel with full regression coverage

Panel v1.1.4 / Engine v0.2.12 (2026-03-07)

tool_choice="none" fix: Content no longer swallowed when tool markers detected with tools suppressed
suppress_reasoning: Reasoning leaks plugged in both API paths
First-launch UX: Auto-creates initial chat, dynamic About page version
1571 engine tests, 530 panel tests across 6 vitest suites

See Panel Changelog and Engine Changelog for full history.

Current Version

Engine v0.2.18 / Panel v1.2.1 — macOS 26+ (Tahoe) for local inference, macOS 14+ for remote endpoints. Apple Silicon (M1, M2, M3, M4)

License

Apache 2.0 — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.0

Apr 29, 2026

1.3.99

Apr 27, 2026

1.3.98

Apr 27, 2026

1.3.97

Apr 26, 2026

1.3.96

Apr 26, 2026

1.3.95

Apr 26, 2026

1.3.94

Apr 26, 2026

1.3.93

Apr 26, 2026

1.3.92

Apr 26, 2026

1.3.86

Apr 24, 2026

1.3.85

Apr 24, 2026

1.3.84

Apr 23, 2026

1.3.83

Apr 23, 2026

1.3.82

Apr 23, 2026

1.3.81

Apr 23, 2026

1.3.80

Apr 22, 2026

1.3.79

Apr 22, 2026

1.3.78

Apr 22, 2026

1.3.77

Apr 22, 2026

1.3.76

Apr 21, 2026

1.3.75

Apr 21, 2026

1.3.74

Apr 21, 2026

1.3.73

Apr 21, 2026

1.3.72

Apr 21, 2026

1.3.71

Apr 21, 2026

1.3.70

Apr 20, 2026

1.3.69

Apr 20, 2026

1.3.68

Apr 20, 2026

1.3.67

Apr 20, 2026

1.3.66

Apr 20, 2026

1.3.65

Apr 20, 2026

1.3.64

Apr 20, 2026

1.3.63

Apr 20, 2026

1.3.61

Apr 17, 2026

1.3.59

Apr 17, 2026

1.3.58

Apr 16, 2026

1.3.57

Apr 16, 2026

1.3.56

Apr 16, 2026

1.3.55

Apr 15, 2026

1.3.54

Apr 15, 2026

1.3.53

Apr 14, 2026

1.3.52

Apr 14, 2026

1.3.51

Apr 14, 2026

1.3.50

Apr 14, 2026

1.3.49

Apr 14, 2026

1.3.35

Apr 9, 2026

1.3.34

Apr 9, 2026

1.3.33

Apr 9, 2026

1.3.30

Apr 7, 2026

1.3.29

Apr 6, 2026

1.3.28

Apr 5, 2026

1.3.27

Apr 4, 2026

1.3.26

Apr 3, 2026

1.3.14

Mar 26, 2026

1.3.11

Mar 25, 2026

1.3.5

Mar 21, 2026

1.3.4

Mar 21, 2026

1.3.3

Mar 21, 2026

1.3.0

Mar 20, 2026

1.0.10

Mar 20, 2026

1.0.9

Mar 19, 2026

1.0.8

Mar 18, 2026

1.0.7

Mar 18, 2026

1.0.6

Mar 17, 2026

1.0.5

Mar 17, 2026

1.0.4

Mar 17, 2026

1.0.3

Mar 17, 2026

1.0.2

Mar 16, 2026

1.0.1

Mar 16, 2026

This version

1.0.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmlx-1.0.0.tar.gz (572.2 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vmlx-1.0.0-py3-none-any.whl (404.5 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file vmlx-1.0.0.tar.gz.

File metadata

Download URL: vmlx-1.0.0.tar.gz
Upload date: Mar 16, 2026
Size: 572.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vmlx-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`cfb71b7b310a8f78842aad51044111da3bdf142a524048d4633da74941504dc4`
MD5	`3d5dca0ef7b6da820f8d951bdd5513ee`
BLAKE2b-256	`aa29bbd1313278786673fd31db6b758ff8388042762e8fcce63cf931ebeec103`

See more details on using hashes here.

File details

Details for the file vmlx-1.0.0-py3-none-any.whl.

File metadata

Download URL: vmlx-1.0.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 404.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vmlx-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4c0a39f4b849dce40a3ce8ea4973f4fe50f5c14c826d8902b671524a0f24158`
MD5	`56d9470e00896e46ac952cd704bc96b3`
BLAKE2b-256	`82bb249e9eaa5f15c6a0018daf90c24c6b1fc37bb0b537a0b2748a5a72513feb`

See more details on using hashes here.

vmlx 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is vMLX?

Key Features

Inference Engine (v0.2.18)

Desktop App (Panel v1.2.1)

Quick Start

Desktop App (recommended)

Engine Only (CLI)

Use with OpenAI SDK

API Endpoints

Reasoning Models

Tool Calling

Architecture

Tech Stack

Recent Changes

Panel v1.2.1 / Engine v0.2.18 (2026-03-09)

Panel v1.2.0 / Engine v0.2.18 (2026-03-09)

Panel v1.1.4 / Engine v0.2.12 (2026-03-07)

Current Version

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes