Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac
Project description
vMLX — Swift (dev branch)
For a production Swift engine with a more streamlined, user-ready feature set, please use osaurus.ai.
This tree is where the more experimental engine paths live — DDFlash, JANG smelt, JANGTQ, Flash MoE, hybrid-SSM cache work, and other research-y directions that aren't ready for a shipping product yet.
⚠️ EXPERIMENTAL SANDBOX — NOT A REAL APP ⚠️
This
devbranch is NOT a product. It is not meant to be used. It is not meant to be "tested" the way you'd test an app.This tree is an experimental workbench for trying new low-level research directions that have no place in the stable user app yet. Specifically:
- DDFlash / JANG-DFlash speculative decoding — block-diffusion drafter + DDTree beam + coordinator-aware verify. See
Sources/vMLXLMCommon/DFlash/.- JANG smelt — partial-expert streaming / smelting. The UI flag is live; the Swift-side loader is not — Stream emits an honest per-request warning when the toggle is on.
- JANGTQ / MXTQ quant formats — native Swift repack, TurboQuant KV cache, MXTQ packed dequant (PRNG parity still WIP). See
Sources/vMLXLMCommon/JangMXTQDequant.swift,JangSpecBundleLoader.swift,TurboQuantKVCacheinSources/vMLXLMCommon/TurboQuant/.- Flash MoE expert streaming from SSD, cache coordinator work, hybrid-SSM + sliding-window disk round-trip, Llama 4 iRope, etc.
If you want to use vMLX, run the stable Electron + bundled-Python app at jjang-ai/mlxstudio (current release v1.3.54). That is the shipping product.
If you want to use this Swift dev branch: don't. It's not for that. It exists so a handful of people can try experimental engine paths end-to-end and break things in isolation without putting user installs at risk.
Do not treat dev DMGs as app beta builds.
vdev-*tags on mlxstudio releases are engineering snapshots — prereleased so the five people touching this tree can grab a notarized binary quickly. They are never promoted to user channels. Do not file "this is broken" bug reports against them as if they were products; reproductions welcome, expectations of stability are not.Things break. APIs shift. Whole model families don't work yet. Cache tiers have known correctness gaps. Multi-turn reuse is incomplete on some hybrid paths. Do not deploy this. Do not integrate it into anything that matters. Do not rely on it for any production workload.
What this is
The entire MLX inference stack — Metal kernels, quant paths, attention
paths, scheduler, tokenizer, chat loop, HTTP routes, desktop UI —
in a single SwiftPM package. No external path: dependencies, no
upstream drift risk. We control every layer, and when we need a
change we commit it here directly instead of coordinating with
mlx-swift, mlx-swift-examples, or vmlx-swift-lm upstreams.
As of 2026-04-13 the vendoring is complete: Cmlx + all 8 MLX Swift
targets live next to the vMLX* targets under one Package.swift
(23 targets, 5 external deps only).
Binaries produced:
vmlxctl— CLI (serve,chat,pull,ls,dflash-smoke)vMLX— SwiftUI macOS app (5 modes: Chat, Server, Image, Terminal, API)
Status & limitations
Honest snapshot of the dev branch. Anything not listed here should be
assumed broken or absent.
✅ Works (live-verified today)
| Area | State |
|---|---|
| Model load + text generation | ✅ Llama 3.2 1B, Qwen3 0.6B, Gemma 4 e2b (4-bit MLX) |
Full swift build |
✅ clean, all 23 targets |
| Dev DMG build | ✅ notarized + stapled, vdev-20260415-0612 shipped |
vmlxctl chat / serve / pull / ls |
✅ basic paths |
HTTP server: OpenAI /v1/chat/completions, /v1/models |
✅ streaming + non-streaming |
HTTP server: Ollama /api/{chat,generate,tags,...} |
✅ |
HTTP server: Anthropic /v1/messages |
✅ |
Admin: /admin/{soft,deep}-sleep, /wake, /cache/*, /dflash/* |
✅ |
| Model library auto-scan (HuggingFace cache) | ✅ |
| Model-family auto-detection (4-tier: JANG gold → silver → bronze → fallback) | ✅ |
| Cache stack: paged (L1) + memory (L1.5) + disk (L2) + SSM companion | ✅ live, with known gaps below |
| Prefix cache multi-turn reuse | ✅ standard path; ⚠️ partial on DFlash |
| Sliding-window attention disk round-trip (Gemma 3/4, Mistral 4 maxKVSize) | ✅ via .rotating LayerKind |
BaichuanM1 CacheList disk serialization |
✅ via .cachelist LayerKind (F-G2) |
| JANG-DFlash speculative decoding | ✅ text-only, MiniMax / Mistral 4 / DeepSeek V3 targets, coordinator-aware |
| Flash MoE expert streaming (Phase 2b on some families) | ✅ opt-in per model |
| TurboQuant KV cache integration | ✅ via make_cache patch |
| Reasoning/tool parsers | ✅ 13 reasoning + 15 tool parsers registered |
| §15 reasoning-off → content reroute | ✅ |
Logs: LogStore ring + SwiftUI LogsPanel + per-request RequestLogger |
✅ |
| Settings: 4-tier merge (global → session → chat → request) | ✅ |
| Download manager: HF auth, byte resume, progress bar | ✅ |
| Embeddings endpoint | ✅ |
Whisper ASR /v1/audio/transcriptions |
✅ |
TTS /v1/audio/speech |
⚠️ placeholder tone backend only (Kokoro scaffolded) |
| MCP stdio JSON-RPC + tool dispatch | ✅ |
Terminal mode with bash tool auto-inject |
✅ |
/metrics Prometheus endpoint |
✅ |
⚠️ Partial / flaky
- MXTQ PRNG mismatch — Swift
JangMXTQDequantuses POSIXsrand48; Python writer uses NumPy PCG64. Sign sequences diverge → some MXTQ bundles decode garbage weights. Known blocker for several JANGTQ checkpoints. Fix: port PCG64 to Swift or re-seed the Python writer. - Engine.LoadOptions default drift — stricter cache defaults
(
cacheMemoryPercent=0.10,maxCacheBlocks=500) thanGlobalSettings(0.30/1000). Code paths that constructLoadOptions()directly get smaller caches than configured. - Smelt mode — UI toggle + settings field exist, but the Swift
engine has no partial-expert-loader equivalent to Python's
smelt_loader.py. Labelled honestly as "Python engine only"; Stream emits a per-request warning whensmelt=true. - DFlash streaming reasoning split — DFlash emits N-token blocks;
v1 routes all decoded text as
.content. Client-side reasoning-extractor still works after the fact, but live<think>...</think>delta routing is not live on the DFlash path. - DFlash tools/images — falls back to standard path (logged) when the request includes tools or images. Text-only path only for v1.
- xcodegen + SwiftPM app bundling — the
xcodebuild archivepath produces a bare executable instead of a.app. The ship-DMG script assembles the bundle manually from DerivedData products. Root cause not yet diagnosed.
❌ Broken or not yet implemented
- Image generation
.generate()bodies — Flux / Qwen-Image / Z-Image / SeedVR2 / FIBO DiT forward passes are still scaffolded. Biggest user-visible gap versus the Electron app. - Several model families stamped in silver table but no Swift class
— CogVLM, Molmo, InternVL, Florence, GOT-OCR (F-G22), DeepSeek VL
(F-G10). Silver rows resolve but
Engine.loadfalls through. - FlashMoE family conformance gaps — DeepSeek V3 (F-G6),
GLM-5
glm_moe_dsa(F-G7), Jamba (F-G8), Granite3 MoE (F-G9). Protocol is ready, per-modelFlashMoEReplaceableconformance not wired. - Tool-parser silver rows missing — XLaM, Functionary (F-G11);
OlmoE, BailingMoe (F-G12). Fall through to
native. - Mistral 4 VLM dedicated text config (F-G13), Phi-4 reasoning parser verification (F-G14), Llama 4 tool format verification (F-G15), RWKVCache dedicated class (F-G16), MiniMax Lightning Attention cache verification (F-G17).
- CacheList multi-sub-cache walker (F-G18) — partially addressed
by F-G2 for the
(Mamba, KV)case; hypothetical(Mamba, Rotating, Full)wrappers would still lose the second KV. - Step-3.5 reasoning + tool-call interleave (F-G19) — no test coverage.
- GPT-OSS tool parser verification (F-G20) — silver row stamps
glm47parser; unverified against real checkpoint. - Gemma 4 image-token defensive assertion (F-G21).
- PaliGemma template probe edge case (F-G23).
- Kokoro neural TTS backend — scaffolded in
vMLXTTS/Kokoro/KokoroBackend.swift;/v1/audio/speechreturns deterministic placeholder-tone WAV until the 9-step port lands. - Whisper sliding-window for audio > 30s — single-pass only; no temperature fallback, no beam search, no word-level timestamps.
- Universal binary — arm64-only.
HasDTypegates Float16 on#if !arch(x86_64)upstream, so x86_64 slices won't type-check. - App target bundling via xcodegen — see above; manual assembly works but needs proper fix.
Stack layout
Package.swift 23 local targets + 5 external deps
project.yml XcodeGen spec for the .app wrapper
scripts/
stage-metallib.sh stages mlx.metallib next to SwiftPM binaries
Sources/
Cmlx/ vendored MLX + mlx-c C++ (metal kernels inline)
+ default.metallib (prebuilt, for SwiftPM)
MLX/ MLXNN/ MLXFast/ MLX Swift runtime (vendored from mlx-swift)
MLXFFT/ MLXLinalg/
MLXOptimizers/ MLXRandom/
vMLXLMCommon/ caches, paged cache, SSM companion, TQ,
Flash MoE, DFlash, evaluate loop, JANG loader,
MXTQ dequant
vMLXLLM/ ~50 LLM model classes
vMLXVLM/ ~15 vision-language model classes
vMLXEmbedders/ embedding model classes
vMLXFluxKit/ vMLXFluxModels/ vMLXFluxVideo/ vMLXFlux/
image / video generation stack (WIP)
vMLXWhisper/ vMLXTTS/ audio IO
vMLXEngine/ Engine actor: load, stream, cache, MCP,
Flash MoE, DFlash, ModelCapabilities,
CapabilityDetector, settings, metrics
vMLXServer/ Hummingbird routes:
OpenAI / Ollama / Anthropic / Admin / MCP
/ Metrics / Gateway
vMLXApp/ SwiftUI app (5 modes)
vMLXTheme/ Linear-inspired tokens
vMLXCLI/ vmlxctl
vMLX/
Assets.xcassets/ app icon set
Info.plist template (xcodegen fills $(...))
vMLX.entitlements App Sandbox disabled (Terminal mode)
Build
# Requires: macOS 14+, Xcode 15.4+ (Swift 5.10), xcodegen (brew install xcodegen)
git clone -b dev https://github.com/jjang-ai/vmlx.git
cd vmlx
# --- CLI (SwiftPM) ---
swift build -c release
# CRITICAL: colocate mlx.metallib next to the binary. Without this
# every model load fails with "Failed to load the default metallib."
# (This is because the SwiftPM flat bundle layout doesn't match what
# `load_swiftpm_library` in device.cpp expects, so we fall through to
# the first-try `load_colocated_library` path instead.)
./scripts/stage-metallib.sh release
./.build/release/vmlxctl serve --model /path/to/model
./.build/release/vmlxctl chat --model /path/to/model
./.build/release/vmlxctl pull mlx-community/Qwen3-32B-4bit
./.build/release/vmlxctl list
# --- SwiftUI app (Xcode archive path) ---
# xcodegen produces vMLX.xcodeproj; current archive path builds a bare
# executable, not a .app bundle (known bundling quirk). The ship-DMG
# script manually assembles the .app from DerivedData products — see
# PROGRESS.md for the exact steps until we fix the xcodegen config.
xcodegen
open vMLX.xcodeproj
arm64 only.
HTTP surfaces
| Family | Endpoints |
|---|---|
| OpenAI | /v1/{chat/completions, completions, responses, embeddings, models, rerank, images/generations, images/edits, audio/transcriptions, audio/speech} |
| Ollama | /api/{chat, generate, embeddings, embed, tags, show, ps, version, pull} |
| Anthropic | /v1/messages (streaming, vision, document, server_tool_use) |
| Admin | /health, /admin/{soft-sleep, deep-sleep, wake, cache/stats, dflash, dflash/load, dflash/unload, models/:id} |
| MCP | /v1/mcp/{tools, servers, execute}, /mcp/:server/:method |
| Metrics | /metrics (Prometheus text) |
| Gateway | multiplexes single base-URL across N model sessions |
Responses API (/v1/responses) covers both string and structured
input shapes (message / function_call / function_call_output /
input_text / input_image), tools, tool_choice,
reasoning.effort bucketing, and streams the full Responses event
family (response.created, output_item.added, output_text.delta,
reasoning_summary_text.delta, function_call_arguments.delta,
output_item.done, response.completed, [DONE]).
Roadmap / TODO
Tracked day-to-day in PROGRESS.md. High-level headline items in
priority order:
Blockers for a first user-visible release:
- MXTQ PRNG parity — currently produces garbage for some JANGTQ bundles.
- Image generation
.generate()bodies — Flux/Qwen-Image/Z-Image forward passes still scaffolded. - xcodegen .app bundling — manual assembly works; fix so
xcodebuild archiveproduces a proper.app.
Model-family coverage (F-G matrix — 23 items tracked):
- F-G1 ✅ SSMStateCache mediaSalt
- F-G2 ✅ BaichuanM1 CacheList disk walker
- F-G3 ✅ Llama 4 dedicated model class (iRope + MoE + ChunkedKVCache)
- F-G4 ✅ Gemma 3 tool parser hermes → gemma4
- F-G5 ✅ Gemma 3 mixed SWA+full detection
- F-G6..F-G23 🟡 pending — see
PROGRESS.md+SWIFT-PER-FAMILY-MATRIX-2026-04-15.md- FlashMoE conformance: DeepSeek V3, GLM-5, Jamba, Granite3 MoE
- Missing Swift classes: DeepSeek VL, CogVLM, Molmo, InternVL, Florence, GOT-OCR
- Tool parser silver rows: XLaM, Functionary, OlmoE, BailingMoe
- Verification: Phi-4 reasoning, Llama 4 tool format, MiniMax Lightning Attention, GPT-OSS tool parser
- Cache correctness: RWKVCache dedicated class, CacheList multi-sub walker
- Defensive: Gemma 4 image token assertion, PaliGemma probe
Cache correctness tail:
- Engine.LoadOptions ↔ GlobalSettings default drift
- Smelt partial-expert loader in Swift (or honestly strip the toggle)
- DFlash streaming reasoning parser
- DFlash tool-call path
Audio:
- Kokoro neural TTS backend (9-step port plan in
vMLXTTS/Kokoro/KokoroBackend.swift) - Whisper long-form (sliding window, temperature fallback, word timestamps)
- TTS mp3 / opus / flac transcoding
Build + shipping:
- xcodegen/SwiftPM application-bundle wrapping (see quirk above)
- Automated ship-DMG script for future dev releases
Related docs in-tree
PROGRESS.md— per-session changelog, newest at top. Read this first if you want to know what moved recently.NO-REGRESSION-CHECKLIST.md— release regression matrix.Sources/vMLXLMCommon/FlashMoE/README.md— Flash MoE phase architecture.SWIFT-PER-FAMILY-MATRIX-2026-04-15.md(local only, not pushed) — per-family F-G1..F-G23 audit with file:line citations.
Not in this tree
- Production Electron app —
panel/subtree kept for reference but never built in this pipeline. Real v1.3.x releases come out ofjjang-ai/vmlxmainbranch via electron-builder. - User documentation — this README is a dev log, not a user
manual. User docs live on
jjang-ai/mlxstudiowhen the Swift path is mature enough to document for users.
Legal
© 2026 Jinho Jang. Source licensed as-is, no warranty, dev-preview only. Do not redistribute dev DMGs to end users.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vmlx-1.3.93.tar.gz.
File metadata
- Download URL: vmlx-1.3.93.tar.gz
- Upload date:
- Size: 727.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abed8583bcb081e0d28e20bb10915f956a00bddd472757932d81f590d6ce3e2d
|
|
| MD5 |
141e0fd52197f229851464061421910a
|
|
| BLAKE2b-256 |
6da786e76dc97775bf1ff1537cea38a36204f592cc6b10544bf86b4614d3cf95
|
File details
Details for the file vmlx-1.3.93-py3-none-any.whl.
File metadata
- Download URL: vmlx-1.3.93-py3-none-any.whl
- Upload date:
- Size: 820.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef3347ea03e36792f20d0e60a7ce74bf730bd83dfafa88c432603d9b08ff037e
|
|
| MD5 |
97fb5882d7b3d51639fd62a4889c7615
|
|
| BLAKE2b-256 |
6a9168b8e189a72a11e8b33eac26d1ea1b72e508765bb24be0a104aebde277a3
|