Local AI inference for Apple Silicon — Text, Image, Video & Audio generation on Mac
Project description
Local AI Engine for Apple Silicon
Run LLMs, VLMs, and image generation models entirely on your Mac.
OpenAI + Anthropic compatible API. No cloud. No API keys. No data leaving your machine.
Quickstart • Models • Features • Image Gen • API • Desktop App • JANG • CLI • Config • Contributing
| Chat with any MLX model -- thinking mode, streaming, and syntax highlighting | Agentic chat with full coding capabilities -- tool use and structured output |
Quickstart
Install from PyPI
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
Your local AI server is now running at http://0.0.0.0:8000 with an OpenAI-compatible API. Works with any model from mlx-community -- thousands of models ready to go.
Or download the desktop app
Get MLX Studio -- a native macOS app with chat UI, model management, image generation, and developer tools. No terminal required.
Use with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Use with Anthropic SDK
import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8000/v1", api_key="not-needed")
message = client.messages.create(
model="local",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)
Use with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Model Support
vMLX runs any MLX model. Point it at a HuggingFace repo or local path and go.
| Type | Models |
|---|---|
| Text LLMs | Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral, Gemma 3, Phi-4, DeepSeek, GLM-4, MiniMax, Nemotron, StepFun, and any mlx-lm model |
| Vision LLMs | Qwen-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n |
| MoE Models | Qwen 3.5 MoE (A3B/A10B), Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4 |
| Hybrid SSM | Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention) |
| Image Gen | Flux Schnell/Dev, Z-Image Turbo, Flux Klein (via mflux) |
| Embeddings | Any mlx-lm compatible embedding model |
| Reranking | Cross-encoder reranking models |
| Audio | Kokoro TTS, Whisper STT (via mlx-audio) |
Features
Inference Engine
| Feature | Description |
|---|---|
| Continuous Batching | Handle multiple concurrent requests efficiently |
| Prefix Cache | Reuse KV states for repeated prompts -- makes follow-up messages instant |
| Paged KV Cache | Block-based caching with content-addressable deduplication |
| KV Cache Quantization | Compress cached states to q4/q8 for 2-4x memory savings |
| Disk Cache (L2) | Persist prompt caches to SSD -- survives server restarts |
| Block Disk Cache | Per-block persistent cache paired with paged KV cache |
| Speculative Decoding | Small draft model proposes tokens for 20-90% speedup |
| JIT Compilation | mx.compile Metal kernel fusion (experimental) |
| Hybrid SSM Support | Mamba/GatedDeltaNet layers handled correctly alongside attention |
5-Layer Cache Architecture
Request -> Tokens
|
L1: Memory-Aware Prefix Cache (or Paged Cache)
| miss
L2: Disk Cache (or Block Disk Store)
| miss
Inference -> float16 KV states
|
KV Quantization -> q4/q8 for storage
|
Store back into L1 + L2
Tool Calling
Auto-detected parsers for every major model family:
qwen - llama - mistral - hermes - deepseek - glm47 - minimax - nemotron - granite - functionary - xlam - kimi - step3p5
Reasoning / Thinking Mode
Auto-detected reasoning parsers that extract <think> blocks:
qwen3 (Qwen3, QwQ, MiniMax, StepFun) - deepseek_r1 (DeepSeek R1, Gemma 3, GLM, Phi-4) - openai_gptoss (GLM Flash, GPT-OSS)
Audio
| Feature | Description |
|---|---|
| Text-to-Speech | Kokoro TTS via mlx-audio -- multiple voices, streaming output |
| Speech-to-Text | Whisper STT via mlx-audio -- transcription and translation |
Image Generation
Generate images locally with Flux models via mflux.
pip install vmlx[image]
vmlx serve ~/.mlxstudio/models/image/flux1-schnell-4bit
API
# curl
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "schnell",
"prompt": "A cat astronaut floating in space with Earth in the background",
"size": "1024x1024",
"n": 1
}'
# Python (OpenAI SDK)
response = client.images.generate(
model="schnell",
prompt="A cat astronaut floating in space with Earth in the background",
size="1024x1024",
n=1,
)
Supported Models
| Model | Steps | Speed | Quality |
|---|---|---|---|
| Flux Schnell | 4 | Fastest | Good |
| Flux Dev | 20 | Slow | Best |
| Z-Image Turbo | 4 | Fast | Sharp |
| Flux Klein 4B | 20 | Medium | Compact |
| Flux Klein 9B | 20 | Medium | Balanced |
API Reference
Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
OpenAI Chat Completions API (streaming + non-streaming) |
POST |
/v1/messages |
Anthropic Messages API |
POST |
/v1/responses |
OpenAI Responses API |
POST |
/v1/completions |
Text completions |
POST |
/v1/images/generations |
Image generation |
POST |
/v1/embeddings |
Text embeddings |
POST |
/v1/rerank |
Document reranking |
POST |
/v1/audio/transcriptions |
Speech-to-text (Whisper) |
POST |
/v1/audio/speech |
Text-to-speech (Kokoro) |
GET |
/v1/models |
List loaded models |
GET |
/v1/cache/stats |
Cache statistics |
GET |
/health |
Server health check |
curl Examples
Chat completion (streaming)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Explain quantum computing in 3 sentences."}],
"stream": true,
"temperature": 0.7
}'
Chat completion with thinking mode
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Solve: what is 23 * 47?"}],
"enable_thinking": true,
"stream": true
}'
Tool calling
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'
Anthropic Messages API
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: not-needed" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "local",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello!"}]
}'
Embeddings
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"input": "The quick brown fox jumps over the lazy dog"
}'
Text-to-speech
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Hello, welcome to vMLX!",
"voice": "af_heart"
}' --output speech.wav
Speech-to-text
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=whisper
Image generation
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "schnell",
"prompt": "A mountain landscape at sunset",
"size": "1024x1024"
}'
Reranking
curl http://localhost:8000/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"query": "What is machine learning?",
"documents": [
"ML is a subset of AI",
"The weather is sunny today",
"Neural networks learn from data"
]
}'
Cache stats
curl http://localhost:8000/v1/cache/stats
Health check
curl http://localhost:8000/health
Desktop App
vMLX includes a native macOS desktop app (MLX Studio) with 5 modes:
| Mode | Description |
|---|---|
| Chat | Conversation interface with chat history, thinking mode, tool calling, agentic coding |
| Server | Manage model sessions -- start, stop, configure, monitor |
| Image | Text-to-image generation with Flux models |
| Tools | Model converter, GGUF-to-MLX, inspector, diagnostics |
| API | Live endpoint reference with copy-pasteable code snippets |
| Image generation with Flux model selection | Developer tools -- model conversion and diagnostics |
| Anthropic Messages API endpoint -- full compatibility | GGUF to MLX conversion -- bring your own models |
Download
Get the latest DMG from MLX Studio Releases, or build from source:
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx/panel
npm install && npm run build
npx electron-builder --mac dmg
Menu Bar
vMLX lives in your menu bar showing all running models, GPU memory usage, and quick controls.
Advanced Quantization
vMLX supports standard MLX quantization (4-bit, 8-bit uniform) out of the box. For users who want to push further, JANG adaptive mixed-precision assigns different bit widths to different layer types -- attention gets more bits, MLP layers get fewer -- achieving better quality at the same model size.
JANG Profiles
| Profile | Attention | Embeddings | MLP | Avg Bits | Use Case |
|---|---|---|---|---|---|
JANG_2M |
8-bit | 4-bit | 2-bit | ~2.5 | Balanced compression |
JANG_2L |
8-bit | 6-bit | 2-bit | ~2.7 | Quality 2-bit |
JANG_3M |
8-bit | 3-bit | 3-bit | ~3.2 | Recommended |
JANG_4M |
8-bit | 4-bit | 4-bit | ~4.2 | Standard quality |
JANG_6M |
8-bit | 6-bit | 6-bit | ~6.2 | Near lossless |
Convert
pip install vmlx[jang]
# Standard MLX quantization
vmlx convert my-model --bits 4
# JANG adaptive quantization
vmlx convert my-model --jang-profile JANG_3M
# Activation-aware calibration (better at 2-3 bit)
vmlx convert my-model --jang-profile JANG_2L --calibration-method activations
# Serve the converted model
vmlx serve ./my-model-JANG_3M --continuous-batching --use-paged-cache
Pre-quantized JANG models are available at JANGQ-AI on HuggingFace.
CLI Commands
vmlx serve <model> # Start inference server
vmlx convert <model> --bits 4 # MLX uniform quantization
vmlx convert <model> -j JANG_3M # JANG adaptive quantization
vmlx info <model> # Model metadata and config
vmlx doctor <model> # Run diagnostics
vmlx bench <model> # Performance benchmarks
Configuration
Server Options
vmlx serve <model> \
--host 0.0.0.0 \ # Bind address (default: 0.0.0.0)
--port 8000 \ # Port (default: 8000)
--api-key sk-your-key \ # Optional API key authentication
--continuous-batching \ # Enable concurrent request handling
--enable-prefix-cache \ # Reuse KV states for repeated prompts
--use-paged-cache \ # Block-based KV cache with dedup
--kv-cache-quantization q8 \ # Quantize cache: q4 or q8
--enable-disk-cache \ # Persist cache to SSD
--enable-jit \ # JIT Metal kernel compilation
--tool-call-parser auto \ # Auto-detect tool call format
--reasoning-parser auto \ # Auto-detect thinking format
--log-level INFO \ # Logging: DEBUG, INFO, WARNING, ERROR
--max-model-len 8192 \ # Max context length
--speculative-model <model> \ # Draft model for speculative decoding
--cors-origins "*" # CORS allowed origins
Quantization Options
vmlx convert <model> \
--bits 4 \ # Uniform quantization bits: 2, 3, 4, 6, 8
--group-size 64 \ # Quantization group size (default: 64)
--output ./output-dir \ # Output directory
--jang-profile JANG_3M \ # JANG mixed-precision profile
--calibration-method activations # Activation-aware calibration
Image Generation Options
pip install vmlx[image]
vmlx serve <flux-model> \
--port 8001 \ # Run on separate port from text model
--host 0.0.0.0
Audio Options
TTS and STT require the mlx-audio package:
pip install mlx-audio
# TTS: serve Kokoro model
vmlx serve kokoro --port 8002
# STT: serve Whisper model
vmlx serve whisper --port 8003
Optional Dependencies
pip install vmlx # Core: text LLMs, VLMs, embeddings, reranking
pip install vmlx[image] # + Image generation (mflux)
pip install vmlx[jang] # + JANG quantization tools
pip install vmlx[dev] # + Development/testing tools
pip install vmlx[image,jang] # Multiple extras
Architecture
+--------------------------------------------+
| Desktop App (Electron) |
| Chat | Server | Image | Tools | API |
+--------------------------------------------+
| Session Manager (TypeScript) |
| Process spawn | Health monitor | Tray |
+--------------------------------------------+
| vMLX Engine (Python / FastAPI) |
| +--------+ +---------+ +-----------+ |
| |Simple | | Batched | | ImageGen | |
| |Engine | | Engine | | Engine | |
| +---+----+ +----+----+ +-----+-----+ |
| | | | |
| +---+------------+--+ +-----+-----+ |
| | mlx-lm / mlx-vlm | | mflux | |
| +--------+-----------+ +-----------+ |
| | |
| +--------+----------------------------+ |
| | MLX Metal GPU Backend | |
| | quantized_matmul | KV cache | SDPA | |
| +--------------------------------------+ |
+--------------------------------------------+
| L1: Prefix Cache (Memory-Aware / Paged) |
| L2: Disk Cache (Persistent / Block Store) |
| KV Quant: q4/q8 at storage boundary |
+--------------------------------------------+
Contributing
Contributions are welcome. Here is how to set up a development environment:
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx
# Python engine
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,jang,image]"
pytest tests/ -k "not Async" # 1894+ tests
# Electron desktop app
cd panel && npm install
npm run dev # Development mode with hot reload
npx vitest run # 1253+ tests
Project Structure
vmlx/
vmlx_engine/ # Python inference engine (FastAPI server)
panel/ # Electron desktop app (React + TypeScript)
src/main/ # Electron main process
src/renderer/ # React frontend
src/preload/ # IPC bridge
tests/ # Python test suite
assets/ # Screenshots and logos
Guidelines
- Run the full test suite before submitting PRs
- Follow existing code style and patterns
- Include tests for new features
- Update documentation for user-facing changes
License
Apache License 2.0 -- see LICENSE.
Built by Jinho Jang (eric@jangq.ai)
JANGQ AI • PyPI • GitHub • Downloads
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vmlx-1.0.1.tar.gz.
File metadata
- Download URL: vmlx-1.0.1.tar.gz
- Upload date:
- Size: 586.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df5f32fe507d433deef39e32f4eadab2ea09e42ceb3ef6fef1b6d7e5b33ef616
|
|
| MD5 |
ba94765d82733338ac0ebd286bf49546
|
|
| BLAKE2b-256 |
02b0815b6f943b37642ede408b69f1633693598eebe36ca3beffa6ed6047b807
|
File details
Details for the file vmlx-1.0.1-py3-none-any.whl.
File metadata
- Download URL: vmlx-1.0.1-py3-none-any.whl
- Upload date:
- Size: 416.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
400bbe1baf5dad8fe1e4085f7998114dee89bca3080a9b49c4e593a693f42b89
|
|
| MD5 |
8a3c9ed7fc26faadd7c0583516d15de8
|
|
| BLAKE2b-256 |
1318c850289820c216b6f2d381d223039e298dc2ed8d62f89c44a7775c29fe7f
|