Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac

These details have not been verified by PyPI

Project links

Project description

vMLX

The most complete MLX inference engine for Apple Silicon.

Run local LLMs, VLMs, and image generation models with full GPU acceleration via MLX -- continuous batching, 5-layer cache stack, 14 tool call parsers, Anthropic + OpenAI API compatibility, vision/video/audio multimodal, image generation, and JANG adaptive quantization.

pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

Desktop app: Download the full GUI experience from MLX Studio -- no terminal required.

Stack layout (all-in-one, vendored)

vmlx now ships both implementations under one roof. The Swift side controls the entire stack from Metal kernels up — one Package.swift, one .build, one swift test run:

/Users/eric/vmlx/
├── swift/                      ← Swift stack — SwiftPM, 21 local targets, 225 tests
│   ├── Package.swift           ← 21 targets, 5 external deps only
│   ├── Sources/
│   │   │ ─── MLX runtime (merged from mlx-swift @ vmlx-0.31.3) ───
│   │   ├── Cmlx/               ← 23 MB mlx + mlx-c submodule w/ Metal kernels
│   │   ├── MLX/                ← core tensor API
│   │   ├── MLXNN/              ← nn.Module + layers
│   │   ├── MLXFast/            ← SDPA, layer norm, rope
│   │   ├── MLXFFT/ MLXLinalg/ MLXOptimizers/ MLXRandom/
│   │   │
│   │   │ ─── vMLX layer (our code) ───
│   │   ├── vMLXLMCommon/       ← cache, batch, FlashMoE, TurboQuant
│   │   ├── vMLXLLM/            ← ~50 LLM models
│   │   ├── vMLXVLM/            ← ~15 VLM models
│   │   ├── vMLXEmbedders/      ← embedding models
│   │   ├── vMLXFlux*/          ← image/video diffusion
│   │   ├── vMLXEngine/         ← Engine, Settings, Stream, Cache, MCP, FlashMoE
│   │   ├── vMLXServer/         ← Hummingbird routes
│   │   ├── vMLXApp/            ← SwiftUI 5-mode app
│   │   ├── vMLXTheme/
│   │   └── vMLXCLI/            ← `vmlxctl` binary
│   └── PROGRESS.md             ← full multi-session changelog
├── engine/vmlx_engine → /Users/eric/mlx/vllm-mlx/vmlx_engine  (Python engine)
├── app/panel → /Users/eric/mlx/vllm-mlx/panel                 (Electron UI)
├── inference/                  ← benchmarks + configs
├── docs/                       ← architecture docs
├── tests/                      ← cross-matrix regression tests
└── PROGRESS-2026-04-13.md      ← top-level multi-session summary

External Swift deps (5 only): swift-numerics, hummingbird, swift-argument-parser, swift-transformers, Jinja. Everything else — including the MLX runtime — is vendored in-tree.

Build the Swift stack:

cd /Users/eric/vmlx/swift
swift build            # ~1 min clean, 21 targets (8 MLX + 13 vMLX)
swift test             # 225 tests, ~15s
swift run vmlxctl serve --model /path/to/model

Build the Python stack:

pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

See PROGRESS-2026-04-13.md for the full state of the Swift rewrite, swift/APP-SURFACE-AUDIT-2026-04-13.md for per-surface REAL/STUB/MISSING inventory, and swift/SWIFT-ENGINE-ISSUES-AUDIT.md for the GH issue cross-reference against the Swift engine.

Features

Model Support (65+ Families)

Text LLMs -- Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Gemma 2/3, Phi-3/4, DeepSeek V2/V3/R1, GLM-4/4.7, Nemotron, MiniMax, Kimi, Step, and any mlx-lm model
Vision LLMs -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n, Phi-3-Vision
Mixture-of-Experts -- Qwen 3.5 MoE, Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4
Hybrid SSM -- Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention)
Image Generation -- Flux Schnell/Dev/Kontext/Krea, Z-Image Turbo, Flux Klein (via mflux)
Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
JANG -- Adaptive mixed-precision quantized models, stay quantized in GPU via native QuantizedLinear

API Endpoints

OpenAI + Anthropic compatible -- point any SDK at your local server:

Method	Path	Description
`POST`	`/v1/chat/completions`	OpenAI Chat Completions (streaming, tools, vision, structured output)
`POST`	`/v1/messages`	Anthropic Messages API -- drop-in Claude replacement
`POST`	`/v1/responses`	OpenAI Responses API (agentic format)
`POST`	`/v1/completions`	Text completions
`POST`	`/v1/images/generations`	Image generation (Flux/Z-Image, OpenAI format)
`POST`	`/v1/embeddings`	Text embeddings with dimension control
`POST`	`/v1/rerank`	Document reranking
`POST`	`/v1/audio/speech`	Text-to-speech (Kokoro)
`POST`	`/v1/audio/transcriptions`	Speech-to-text (Whisper)
`GET`	`/v1/models`	List loaded models
`GET`	`/health`	Server health, VRAM, queue length
`GET`	`/v1/cache/stats`	Cache hit rates and memory usage
`POST`	`/v1/cache/warm`	Pre-warm cache with prompts

Anthropic API Compatibility

Use the Anthropic Python/TypeScript SDK -- just change base_url:

from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000/v1", api_key="none")
response = client.messages.create(
    model="local",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)

Full /v1/messages endpoint with streaming
Anthropic tool calling format (auto-translated)
Vision/multimodal via Anthropic content blocks

Tool Calling (14 Parsers)

Auto-detected from model config -- no manual setup:

Parser	Models
`qwen`	Qwen3, Qwen2.5, QwQ
`llama3`	Llama 3/3.1/3.2/3.3/4
`mistral`	Mistral, Mixtral, Codestral
`hermes`	Hermes, NousResearch
`deepseek`	DeepSeek V2/V3
`glm47`	GLM-4.7, ChatGLM4
`minimax`	MiniMax M2.5
`nemotron`	Nemotron, Llama-Nemotron
`granite`	IBM Granite
`functionary`	Functionary v3
`xlam`	Salesforce xLAM
`kimi`	Moonshot Kimi
`step3p5`	StepFun Step-3.5
`auto`	Auto-detect from config.json

Reasoning Models (4 Parsers)

Qwen3 / Qwen3.5 -- <think>...</think> blocks
DeepSeek-R1 -- DeepSeek reasoning format
GPT-OSS / GLM-4.7 -- thinking format
Phi-4-reasoning -- reasoning content
Enable/disable per request, reasoning effort control (low/medium/high)

Vision & Multimodal

Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB), detail levels (auto/low/high)
Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames)
Audio -- base64 or URL input (Qwen3-Audio)
Dedicated MLLM cache for image/video embeddings

Continuous Batching

Handle 32+ concurrent requests with dynamic slot allocation
Configurable prefill and completion batch sizes
Stream interval control
Request pooling for shared GPU memory
Rate limiting and API key authentication

5-Layer Cache Stack

Prefix Cache -- token-level semantic caching with LRU eviction
Paged KV Cache -- block-aware, reduced fragmentation
Disk Cache -- persistent spillover for large contexts
KV Quantization -- q4/q8 compression at storage boundary (2-4x memory savings)
Hybrid SSM Cache -- Mamba + Attention architectures
Auto cache type selection, warming API, stats API

Sampling Parameters

Temperature, Top-P, Top-K, Min-P, Repetition Penalty
Stop sequences, max tokens (up to 131072)
Structured output (json_object and json_schema modes)
Streaming with proper Unicode handling (emoji, CJK, Arabic)
Usage stats in streaming (stream_options.include_usage)

Image Generation

Flux Schnell (4 steps), Dev (20 steps), Kontext, Krea, Klein
Z-Image Turbo (4-bit, 8-bit, full precision)
Configurable steps, guidance, size, seed, sampler
Quantized model support (2-bit to 8-bit)
OpenAI-compatible /v1/images/generations with usage field

Model Conversion

16-bit to MLX -- convert safetensors to MLX format
16-bit to quantized -- 2/4/8-bit MLX quantization
GGUF to MLX -- import GGUF models
MLX to JANG -- adaptive mixed-precision (different bits per layer type)

CLI Reference

vmlx serve <model> [OPTIONS]
  --port 8000
  --host 0.0.0.0
  --continuous-batching
  --enable-prefix-cache
  --cache-type [auto|kv|prefix|paged]
  --cache-memory-percent 0.30
  --max-num-seqs 32
  --prefill-batch-size 4
  --completion-batch-size 16
  --tool-call-parser [auto|qwen|llama|mistral|hermes|deepseek|glm47|minimax|nemotron|granite|functionary|xlam|kimi|step3p5]
  --reasoning-parser [auto|qwen3|deepseek_r1|gptoss]
  --enable-thinking
  --enable-auto-tool-choice
  --api-key <secret>
  --rate-limit 60
  --enable-jit
  --mcp-config mcp.json
  --served-model-name <alias>
  --log-level [INFO|DEBUG]

vmlx bench <model> [OPTIONS]
  --num-prompts 10
  --num-completions 50
  --batch-size 1

Advanced Quantization

JANG adaptive mixed-precision assigns different bit widths per layer type for better quality at the same model size.

vmlx convert model --jang-profile JANG_3M

Pre-quantized models: JANGQ-AI on HuggingFace
Stays quantized in GPU memory via native QuantizedLinear + quantized_matmul
Compatible with all cache layers (prefix, paged, disk, KV quant)

Project Structure

vmlx/
├── vmlx_engine/           # Python inference engine
├── panel/                 # Electron desktop app (MLX Studio)
│   ├── src/main/          # Main process (sessions, chat, tools, DB)
│   ├── src/renderer/      # React UI
│   └── bundled-python/    # Bundled Python 3.12 interpreter
├── tests/                 # Engine test suite (1894+ tests)
└── docs/                  # Documentation

Links

Resource	Link
Desktop App	github.com/jjang-ai/mlxstudio
PyPI	pypi.org/project/vmlx
MLX Models	huggingface.co/mlx-community
JANG Models	huggingface.co/JANGQ-AI
Website	vmlx.net

License

Apache License 2.0

Built by Jinho Jang • eric@jangq.ai • JANGQ AI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.0

May 1, 2026

1.4.9

May 1, 2026

1.4.8

May 1, 2026

1.4.7

May 1, 2026

1.4.6

May 1, 2026

1.4.5

May 1, 2026

1.4.4

May 1, 2026

1.4.3

May 1, 2026

1.4.2

May 1, 2026

This version

1.4.1

Apr 30, 2026

1.4.0

Apr 29, 2026

1.3.99

Apr 27, 2026

1.3.98

Apr 27, 2026

1.3.97

Apr 26, 2026

1.3.96

Apr 26, 2026

1.3.95

Apr 26, 2026

1.3.94

Apr 26, 2026

1.3.93

Apr 26, 2026

1.3.92

Apr 26, 2026

1.3.86

Apr 24, 2026

1.3.85

Apr 24, 2026

1.3.84

Apr 23, 2026

1.3.83

Apr 23, 2026

1.3.82

Apr 23, 2026

1.3.81

Apr 23, 2026

1.3.80

Apr 22, 2026

1.3.79

Apr 22, 2026

1.3.78

Apr 22, 2026

1.3.77

Apr 22, 2026

1.3.76

Apr 21, 2026

1.3.75

Apr 21, 2026

1.3.74

Apr 21, 2026

1.3.73

Apr 21, 2026

1.3.72

Apr 21, 2026

1.3.71

Apr 21, 2026

1.3.70

Apr 20, 2026

1.3.69

Apr 20, 2026

1.3.68

Apr 20, 2026

1.3.67

Apr 20, 2026

1.3.66

Apr 20, 2026

1.3.65

Apr 20, 2026

1.3.64

Apr 20, 2026

1.3.63

Apr 20, 2026

1.3.61

Apr 17, 2026

1.3.59

Apr 17, 2026

1.3.58

Apr 16, 2026

1.3.57

Apr 16, 2026

1.3.56

Apr 16, 2026

1.3.55

Apr 15, 2026

1.3.54

Apr 15, 2026

1.3.53

Apr 14, 2026

1.3.52

Apr 14, 2026

1.3.51

Apr 14, 2026

1.3.50

Apr 14, 2026

1.3.49

Apr 14, 2026

1.3.35

Apr 9, 2026

1.3.34

Apr 9, 2026

1.3.33

Apr 9, 2026

1.3.30

Apr 7, 2026

1.3.29

Apr 6, 2026

1.3.28

Apr 5, 2026

1.3.27

Apr 4, 2026

1.3.26

Apr 3, 2026

1.3.14

Mar 26, 2026

1.3.11

Mar 25, 2026

1.3.5

Mar 21, 2026

1.3.4

Mar 21, 2026

1.3.3

Mar 21, 2026

1.3.0

Mar 20, 2026

1.0.10

Mar 20, 2026

1.0.9

Mar 19, 2026

1.0.8

Mar 18, 2026

1.0.7

Mar 18, 2026

1.0.6

Mar 17, 2026

1.0.5

Mar 17, 2026

1.0.4

Mar 17, 2026

1.0.3

Mar 17, 2026

1.0.2

Mar 16, 2026

1.0.1

Mar 16, 2026

1.0.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmlx-1.4.1.tar.gz (750.9 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vmlx-1.4.1-py3-none-any.whl (835.2 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file vmlx-1.4.1.tar.gz.

File metadata

Download URL: vmlx-1.4.1.tar.gz
Upload date: Apr 30, 2026
Size: 750.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vmlx-1.4.1.tar.gz
Algorithm	Hash digest
SHA256	`f00a0d75ad383ebc5c653dfc0af71d520761c7e525856cf1a31239838d6438f0`
MD5	`7218762fb34a0be08b417179aca5bc99`
BLAKE2b-256	`1fa8cac9f0778069e5ac34b6c7ab318dd152248f533fcd518ac4d1fc19adfc5c`

See more details on using hashes here.

File details

Details for the file vmlx-1.4.1-py3-none-any.whl.

File metadata

Download URL: vmlx-1.4.1-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 835.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vmlx-1.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3e87f07a2ebbecade18e43fb1c31a13ade965cc0b9021d8943363023e8290f8`
MD5	`0243d25a96cbf921819c4dcfcbdfa80a`
BLAKE2b-256	`ccdac536e7076c9b9da71aff69c9f19d7110e2cbdd305cd30494d117758681be`

See more details on using hashes here.

vmlx 1.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vMLX

Stack layout (all-in-one, vendored)

Features

Model Support (65+ Families)

API Endpoints

Anthropic API Compatibility

Tool Calling (14 Parsers)

Reasoning Models (4 Parsers)

Vision & Multimodal

Continuous Batching

5-Layer Cache Stack

Sampling Parameters

Image Generation

Model Conversion

CLI Reference

Advanced Quantization

Project Structure

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes