Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac
Project description
vMLX
The most complete MLX inference engine for Apple Silicon.
Run local LLMs, VLMs, and image generation models with full GPU acceleration via MLX -- continuous batching, 5-layer cache stack, 14 tool call parsers, Anthropic + OpenAI API compatibility, vision/video/audio multimodal, image generation, and JANG adaptive quantization.
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
Desktop app: Download the full GUI experience from MLX Studio -- no terminal required.
Stack layout (all-in-one, vendored)
vmlx now ships both implementations under one roof. The Swift side
controls the entire stack from Metal kernels up — one Package.swift,
one .build, one swift test run:
/Users/eric/vmlx/
├── swift/ ← Swift stack — SwiftPM, 21 local targets, 225 tests
│ ├── Package.swift ← 21 targets, 5 external deps only
│ ├── Sources/
│ │ │ ─── MLX runtime (merged from mlx-swift @ vmlx-0.31.3) ───
│ │ ├── Cmlx/ ← 23 MB mlx + mlx-c submodule w/ Metal kernels
│ │ ├── MLX/ ← core tensor API
│ │ ├── MLXNN/ ← nn.Module + layers
│ │ ├── MLXFast/ ← SDPA, layer norm, rope
│ │ ├── MLXFFT/ MLXLinalg/ MLXOptimizers/ MLXRandom/
│ │ │
│ │ │ ─── vMLX layer (our code) ───
│ │ ├── vMLXLMCommon/ ← cache, batch, FlashMoE, TurboQuant
│ │ ├── vMLXLLM/ ← ~50 LLM models
│ │ ├── vMLXVLM/ ← ~15 VLM models
│ │ ├── vMLXEmbedders/ ← embedding models
│ │ ├── vMLXFlux*/ ← image/video diffusion
│ │ ├── vMLXEngine/ ← Engine, Settings, Stream, Cache, MCP, FlashMoE
│ │ ├── vMLXServer/ ← Hummingbird routes
│ │ ├── vMLXApp/ ← SwiftUI 5-mode app
│ │ ├── vMLXTheme/
│ │ └── vMLXCLI/ ← `vmlxctl` binary
│ └── PROGRESS.md ← full multi-session changelog
├── engine/vmlx_engine → /Users/eric/mlx/vllm-mlx/vmlx_engine (Python engine)
├── app/panel → /Users/eric/mlx/vllm-mlx/panel (Electron UI)
├── inference/ ← benchmarks + configs
├── docs/ ← architecture docs
├── tests/ ← cross-matrix regression tests
└── PROGRESS-2026-04-13.md ← top-level multi-session summary
External Swift deps (5 only): swift-numerics, hummingbird,
swift-argument-parser, swift-transformers, Jinja. Everything
else — including the MLX runtime — is vendored in-tree.
Build the Swift stack:
cd /Users/eric/vmlx/swift
swift build # ~1 min clean, 21 targets (8 MLX + 13 vMLX)
swift test # 225 tests, ~15s
swift run vmlxctl serve --model /path/to/model
Build the Python stack:
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
See PROGRESS-2026-04-13.md for the full state of the Swift rewrite,
swift/APP-SURFACE-AUDIT-2026-04-13.md for per-surface REAL/STUB/MISSING
inventory, and swift/SWIFT-ENGINE-ISSUES-AUDIT.md for the GH issue
cross-reference against the Swift engine.
Features
Model Support (65+ Families)
- Text LLMs -- Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Gemma 2/3, Phi-3/4, DeepSeek V2/V3/R1, GLM-4/4.7, Nemotron, MiniMax, Kimi, Step, and any mlx-lm model
- Vision LLMs -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n, Phi-3-Vision
- Mixture-of-Experts -- Qwen 3.5 MoE, Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4
- Hybrid SSM -- Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention)
- Image Generation -- Flux Schnell/Dev/Kontext/Krea, Z-Image Turbo, Flux Klein (via mflux)
- Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- JANG -- Adaptive mixed-precision quantized models, stay quantized in GPU via native
QuantizedLinear
API Endpoints
OpenAI + Anthropic compatible -- point any SDK at your local server:
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
OpenAI Chat Completions (streaming, tools, vision, structured output) |
POST |
/v1/messages |
Anthropic Messages API -- drop-in Claude replacement |
POST |
/v1/responses |
OpenAI Responses API (agentic format) |
POST |
/v1/completions |
Text completions |
POST |
/v1/images/generations |
Image generation (Flux/Z-Image, OpenAI format) |
POST |
/v1/embeddings |
Text embeddings with dimension control |
POST |
/v1/rerank |
Document reranking |
POST |
/v1/audio/speech |
Text-to-speech (Kokoro) |
POST |
/v1/audio/transcriptions |
Speech-to-text (Whisper) |
GET |
/v1/models |
List loaded models |
GET |
/health |
Server health, VRAM, queue length |
GET |
/v1/cache/stats |
Cache hit rates and memory usage |
POST |
/v1/cache/warm |
Pre-warm cache with prompts |
Anthropic API Compatibility
Use the Anthropic Python/TypeScript SDK -- just change base_url:
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000/v1", api_key="none")
response = client.messages.create(
model="local",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
- Full
/v1/messagesendpoint with streaming - Anthropic tool calling format (auto-translated)
- Vision/multimodal via Anthropic content blocks
Tool Calling (14 Parsers)
Auto-detected from model config -- no manual setup:
| Parser | Models |
|---|---|
qwen |
Qwen3, Qwen2.5, QwQ |
llama3 |
Llama 3/3.1/3.2/3.3/4 |
mistral |
Mistral, Mixtral, Codestral |
hermes |
Hermes, NousResearch |
deepseek |
DeepSeek V2/V3 |
glm47 |
GLM-4.7, ChatGLM4 |
minimax |
MiniMax M2.5 |
nemotron |
Nemotron, Llama-Nemotron |
granite |
IBM Granite |
functionary |
Functionary v3 |
xlam |
Salesforce xLAM |
kimi |
Moonshot Kimi |
step3p5 |
StepFun Step-3.5 |
auto |
Auto-detect from config.json |
Reasoning Models (4 Parsers)
- Qwen3 / Qwen3.5 --
<think>...</think>blocks - DeepSeek-R1 -- DeepSeek reasoning format
- GPT-OSS / GLM-4.7 -- thinking format
- Phi-4-reasoning -- reasoning content
- Enable/disable per request, reasoning effort control (low/medium/high)
Vision & Multimodal
- Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB), detail levels (auto/low/high)
- Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames)
- Audio -- base64 or URL input (Qwen3-Audio)
- Dedicated MLLM cache for image/video embeddings
Continuous Batching
- Handle 32+ concurrent requests with dynamic slot allocation
- Configurable prefill and completion batch sizes
- Stream interval control
- Request pooling for shared GPU memory
- Rate limiting and API key authentication
5-Layer Cache Stack
- Prefix Cache -- token-level semantic caching with LRU eviction
- Paged KV Cache -- block-aware, reduced fragmentation
- Disk Cache -- persistent spillover for large contexts
- KV Quantization -- q4/q8 compression at storage boundary (2-4x memory savings)
- Hybrid SSM Cache -- Mamba + Attention architectures
- Auto cache type selection, warming API, stats API
Sampling Parameters
- Temperature, Top-P, Top-K, Min-P, Repetition Penalty
- Stop sequences, max tokens (up to 131072)
- Structured output (
json_objectandjson_schemamodes) - Streaming with proper Unicode handling (emoji, CJK, Arabic)
- Usage stats in streaming (
stream_options.include_usage)
Image Generation
- Flux Schnell (4 steps), Dev (20 steps), Kontext, Krea, Klein
- Z-Image Turbo (4-bit, 8-bit, full precision)
- Configurable steps, guidance, size, seed, sampler
- Quantized model support (2-bit to 8-bit)
- OpenAI-compatible
/v1/images/generationswithusagefield
Model Conversion
- 16-bit to MLX -- convert safetensors to MLX format
- 16-bit to quantized -- 2/4/8-bit MLX quantization
- GGUF to MLX -- import GGUF models
- MLX to JANG -- adaptive mixed-precision (different bits per layer type)
CLI Reference
vmlx serve <model> [OPTIONS]
--port 8000
--host 0.0.0.0
--continuous-batching
--enable-prefix-cache
--cache-type [auto|kv|prefix|paged]
--cache-memory-percent 0.30
--max-num-seqs 32
--prefill-batch-size 4
--completion-batch-size 16
--tool-call-parser [auto|qwen|llama|mistral|hermes|deepseek|glm47|minimax|nemotron|granite|functionary|xlam|kimi|step3p5]
--reasoning-parser [auto|qwen3|deepseek_r1|gptoss]
--enable-thinking
--enable-auto-tool-choice
--api-key <secret>
--rate-limit 60
--enable-jit
--mcp-config mcp.json
--served-model-name <alias>
--log-level [INFO|DEBUG]
vmlx bench <model> [OPTIONS]
--num-prompts 10
--num-completions 50
--batch-size 1
Advanced Quantization
JANG adaptive mixed-precision assigns different bit widths per layer type for better quality at the same model size.
vmlx convert model --jang-profile JANG_3M
- Pre-quantized models: JANGQ-AI on HuggingFace
- Stays quantized in GPU memory via native
QuantizedLinear+quantized_matmul - Compatible with all cache layers (prefix, paged, disk, KV quant)
Project Structure
vmlx/
├── vmlx_engine/ # Python inference engine
├── panel/ # Electron desktop app (MLX Studio)
│ ├── src/main/ # Main process (sessions, chat, tools, DB)
│ ├── src/renderer/ # React UI
│ └── bundled-python/ # Bundled Python 3.12 interpreter
├── tests/ # Engine test suite (1894+ tests)
└── docs/ # Documentation
Links
| Resource | Link |
|---|---|
| Desktop App | github.com/jjang-ai/mlxstudio |
| PyPI | pypi.org/project/vmlx |
| MLX Models | huggingface.co/mlx-community |
| JANG Models | huggingface.co/JANGQ-AI |
| Website | vmlx.net |
License
Apache License 2.0
Built by Jinho Jang • eric@jangq.ai • JANGQ AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vmlx-1.4.1.tar.gz.
File metadata
- Download URL: vmlx-1.4.1.tar.gz
- Upload date:
- Size: 750.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f00a0d75ad383ebc5c653dfc0af71d520761c7e525856cf1a31239838d6438f0
|
|
| MD5 |
7218762fb34a0be08b417179aca5bc99
|
|
| BLAKE2b-256 |
1fa8cac9f0778069e5ac34b6c7ab318dd152248f533fcd518ac4d1fc19adfc5c
|
File details
Details for the file vmlx-1.4.1-py3-none-any.whl.
File metadata
- Download URL: vmlx-1.4.1-py3-none-any.whl
- Upload date:
- Size: 835.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3e87f07a2ebbecade18e43fb1c31a13ade965cc0b9021d8943363023e8290f8
|
|
| MD5 |
0243d25a96cbf921819c4dcfcbdfa80a
|
|
| BLAKE2b-256 |
ccdac536e7076c9b9da71aff69c9f19d7110e2cbdd305cd30494d117758681be
|