Skip to main content

Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac

Project description

vMLX

The most complete MLX inference engine for Apple Silicon.

Run local LLMs, VLMs, and image generation models with full GPU acceleration via MLX -- continuous batching, 5-layer cache stack, 14 tool call parsers, Anthropic + OpenAI API compatibility, vision/video/audio multimodal, image generation, and JANG adaptive quantization.

pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

PyPI Desktop App License

Desktop app: Download the full GUI experience from MLX Studio -- no terminal required.


Stack layout (all-in-one, vendored)

vmlx now ships both implementations under one roof. The Swift side controls the entire stack from Metal kernels up — one Package.swift, one .build, one swift test run:

/Users/eric/vmlx/
├── swift/                      ← Swift stack — SwiftPM, 21 local targets, 225 tests
│   ├── Package.swift           ← 21 targets, 5 external deps only
│   ├── Sources/
│   │   │ ─── MLX runtime (merged from mlx-swift @ vmlx-0.31.3) ───
│   │   ├── Cmlx/               ← 23 MB mlx + mlx-c submodule w/ Metal kernels
│   │   ├── MLX/                ← core tensor API
│   │   ├── MLXNN/              ← nn.Module + layers
│   │   ├── MLXFast/            ← SDPA, layer norm, rope
│   │   ├── MLXFFT/ MLXLinalg/ MLXOptimizers/ MLXRandom/
│   │   │
│   │   │ ─── vMLX layer (our code) ───
│   │   ├── vMLXLMCommon/       ← cache, batch, FlashMoE, TurboQuant
│   │   ├── vMLXLLM/            ← ~50 LLM models
│   │   ├── vMLXVLM/            ← ~15 VLM models
│   │   ├── vMLXEmbedders/      ← embedding models
│   │   ├── vMLXFlux*/          ← image/video diffusion
│   │   ├── vMLXEngine/         ← Engine, Settings, Stream, Cache, MCP, FlashMoE
│   │   ├── vMLXServer/         ← Hummingbird routes
│   │   ├── vMLXApp/            ← SwiftUI 5-mode app
│   │   ├── vMLXTheme/
│   │   └── vMLXCLI/            ← `vmlxctl` binary
│   └── PROGRESS.md             ← full multi-session changelog
├── engine/vmlx_engine → /Users/eric/mlx/vllm-mlx/vmlx_engine  (Python engine)
├── app/panel → /Users/eric/mlx/vllm-mlx/panel                 (Electron UI)
├── inference/                  ← benchmarks + configs
├── docs/                       ← architecture docs
├── tests/                      ← cross-matrix regression tests
└── PROGRESS-2026-04-13.md      ← top-level multi-session summary

External Swift deps (5 only): swift-numerics, hummingbird, swift-argument-parser, swift-transformers, Jinja. Everything else — including the MLX runtime — is vendored in-tree.

Build the Swift stack:

cd /Users/eric/vmlx/swift
swift build            # ~1 min clean, 21 targets (8 MLX + 13 vMLX)
swift test             # 225 tests, ~15s
swift run vmlxctl serve --model /path/to/model

Build the Python stack:

pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

See PROGRESS-2026-04-13.md for the full state of the Swift rewrite, swift/APP-SURFACE-AUDIT-2026-04-13.md for per-surface REAL/STUB/MISSING inventory, and swift/SWIFT-ENGINE-ISSUES-AUDIT.md for the GH issue cross-reference against the Swift engine.


Features

Model Support (65+ Families)

  • Text LLMs -- Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Gemma 2/3, Phi-3/4, DeepSeek V2/V3/R1, GLM-4/4.7, Nemotron, MiniMax, Kimi, Step, and any mlx-lm model
  • Vision LLMs -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n, Phi-3-Vision
  • Mixture-of-Experts -- Qwen 3.5 MoE, Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4
  • Hybrid SSM -- Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention)
  • Image Generation -- Flux Schnell/Dev/Kontext/Krea, Z-Image Turbo, Flux Klein (via mflux)
  • Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
  • JANG -- Adaptive mixed-precision quantized models, stay quantized in GPU via native QuantizedLinear

API Endpoints

OpenAI + Anthropic compatible -- point any SDK at your local server:

Method Path Description
POST /v1/chat/completions OpenAI Chat Completions (streaming, tools, vision, structured output)
POST /v1/messages Anthropic Messages API -- drop-in Claude replacement
POST /v1/responses OpenAI Responses API (agentic format)
POST /v1/completions Text completions
POST /v1/images/generations Image generation (Flux/Z-Image, OpenAI format)
POST /v1/embeddings Text embeddings with dimension control
POST /v1/rerank Document reranking
POST /v1/audio/speech Text-to-speech (Kokoro)
POST /v1/audio/transcriptions Speech-to-text (Whisper)
GET /v1/models List loaded models
GET /health Server health, VRAM, queue length
GET /v1/cache/stats Cache hit rates and memory usage
POST /v1/cache/warm Pre-warm cache with prompts

Anthropic API Compatibility

Use the Anthropic Python/TypeScript SDK -- just change base_url:

from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000/v1", api_key="none")
response = client.messages.create(
    model="local",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
  • Full /v1/messages endpoint with streaming
  • Anthropic tool calling format (auto-translated)
  • Vision/multimodal via Anthropic content blocks

Tool Calling (14 Parsers)

Auto-detected from model config -- no manual setup:

Parser Models
qwen Qwen3, Qwen2.5, QwQ
llama3 Llama 3/3.1/3.2/3.3/4
mistral Mistral, Mixtral, Codestral
hermes Hermes, NousResearch
deepseek DeepSeek V2/V3
glm47 GLM-4.7, ChatGLM4
minimax MiniMax M2.5
nemotron Nemotron, Llama-Nemotron
granite IBM Granite
functionary Functionary v3
xlam Salesforce xLAM
kimi Moonshot Kimi
step3p5 StepFun Step-3.5
auto Auto-detect from config.json

Reasoning Models (4 Parsers)

  • Qwen3 / Qwen3.5 -- <think>...</think> blocks
  • DeepSeek-R1 -- DeepSeek reasoning format
  • GPT-OSS / GLM-4.7 -- thinking format
  • Phi-4-reasoning -- reasoning content
  • Enable/disable per request, reasoning effort control (low/medium/high)

Vision & Multimodal

  • Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB), detail levels (auto/low/high)
  • Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames)
  • Audio -- base64 or URL input (Qwen3-Audio)
  • Dedicated MLLM cache for image/video embeddings

Continuous Batching

  • Handle 32+ concurrent requests with dynamic slot allocation
  • Configurable prefill and completion batch sizes
  • Stream interval control
  • Request pooling for shared GPU memory
  • Rate limiting and API key authentication

5-Layer Cache Stack

  • Prefix Cache -- token-level semantic caching with LRU eviction
  • Paged KV Cache -- block-aware, reduced fragmentation
  • Disk Cache -- persistent spillover for large contexts
  • KV Quantization -- q4/q8 compression at storage boundary (2-4x memory savings)
  • Hybrid SSM Cache -- Mamba + Attention architectures
  • Auto cache type selection, warming API, stats API

Sampling Parameters

  • Temperature, Top-P, Top-K, Min-P, Repetition Penalty
  • Stop sequences, max tokens (up to 131072)
  • Structured output (json_object and json_schema modes)
  • Streaming with proper Unicode handling (emoji, CJK, Arabic)
  • Usage stats in streaming (stream_options.include_usage)

Image Generation

  • Flux Schnell (4 steps), Dev (20 steps), Kontext, Krea, Klein
  • Z-Image Turbo (4-bit, 8-bit, full precision)
  • Configurable steps, guidance, size, seed, sampler
  • Quantized model support (2-bit to 8-bit)
  • OpenAI-compatible /v1/images/generations with usage field

Model Conversion

  • 16-bit to MLX -- convert safetensors to MLX format
  • 16-bit to quantized -- 2/4/8-bit MLX quantization
  • GGUF to MLX -- import GGUF models
  • MLX to JANG -- adaptive mixed-precision (different bits per layer type)

CLI Reference

vmlx serve <model> [OPTIONS]
  --port 8000
  --host 0.0.0.0
  --continuous-batching
  --enable-prefix-cache
  --cache-type [auto|kv|prefix|paged]
  --cache-memory-percent 0.30
  --max-num-seqs 32
  --prefill-batch-size 4
  --completion-batch-size 16
  --tool-call-parser [auto|qwen|llama|mistral|hermes|deepseek|glm47|minimax|nemotron|granite|functionary|xlam|kimi|step3p5]
  --reasoning-parser [auto|qwen3|deepseek_r1|gptoss]
  --enable-thinking
  --enable-auto-tool-choice
  --api-key <secret>
  --rate-limit 60
  --enable-jit
  --mcp-config mcp.json
  --served-model-name <alias>
  --log-level [INFO|DEBUG]

vmlx bench <model> [OPTIONS]
  --num-prompts 10
  --num-completions 50
  --batch-size 1

Advanced Quantization

JANG adaptive mixed-precision assigns different bit widths per layer type for better quality at the same model size.

vmlx convert model --jang-profile JANG_3M
  • Pre-quantized models: JANGQ-AI on HuggingFace
  • Stays quantized in GPU memory via native QuantizedLinear + quantized_matmul
  • Compatible with all cache layers (prefix, paged, disk, KV quant)

Project Structure

vmlx/
├── vmlx_engine/           # Python inference engine
├── panel/                 # Electron desktop app (MLX Studio)
│   ├── src/main/          # Main process (sessions, chat, tools, DB)
│   ├── src/renderer/      # React UI
│   └── bundled-python/    # Bundled Python 3.12 interpreter
├── tests/                 # Engine test suite (1894+ tests)
└── docs/                  # Documentation

Links

Resource Link
Desktop App github.com/jjang-ai/mlxstudio
PyPI pypi.org/project/vmlx
MLX Models huggingface.co/mlx-community
JANG Models huggingface.co/JANGQ-AI
Website vmlx.net

License

Apache License 2.0


Built by Jinho Jangeric@jangq.aiJANGQ AI

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmlx-1.4.1.tar.gz (750.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vmlx-1.4.1-py3-none-any.whl (835.2 kB view details)

Uploaded Python 3

File details

Details for the file vmlx-1.4.1.tar.gz.

File metadata

  • Download URL: vmlx-1.4.1.tar.gz
  • Upload date:
  • Size: 750.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vmlx-1.4.1.tar.gz
Algorithm Hash digest
SHA256 f00a0d75ad383ebc5c653dfc0af71d520761c7e525856cf1a31239838d6438f0
MD5 7218762fb34a0be08b417179aca5bc99
BLAKE2b-256 1fa8cac9f0778069e5ac34b6c7ab318dd152248f533fcd518ac4d1fc19adfc5c

See more details on using hashes here.

File details

Details for the file vmlx-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: vmlx-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 835.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vmlx-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e3e87f07a2ebbecade18e43fb1c31a13ade965cc0b9021d8943363023e8290f8
MD5 0243d25a96cbf921819c4dcfcbdfa80a
BLAKE2b-256 ccdac536e7076c9b9da71aff69c9f19d7110e2cbdd305cd30494d117758681be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page