MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
Project description
MLX-VLM
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
Table of Contents
- Installation
- Usage
- Activation Quantization (CUDA)
- Multi-Image Chat Support
- Model-Specific Documentation
- Fine-tuning
Model-Specific Documentation
Some models have detailed documentation with prompt formats, examples, and best practices:
| Model | Documentation |
|---|---|
| DeepSeek-OCR | Docs |
| DeepSeek-OCR-2 | Docs |
| DOTS-OCR | Docs |
| GLM-OCR | Docs |
Installation
The easiest way to get started is to install the mlx-vlm package using pip:
pip install -U mlx-vlm
Some models (e.g., Qwen2-VL) require additional dependencies from the torch extra:
pip install -U mlx-vlm[torch]
This installs torch, torchvision, and other dependencies needed by certain model processors.
Usage
Command Line Interface (CLI)
Generate output from a model using the CLI:
# Text generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"
# Image generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg
# Audio generation (New)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav
# Multi-modal generation (Image + Audio)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav
Chat UI with Gradio
Launch a chat interface using Gradio:
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Python Script
Here's an example of how to use MLX-VLM in a Python script:
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)
Audio Example
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load model with audio support
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
# Prepare audio input
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."
# Apply chat template with audio
formatted_prompt = apply_chat_template(
processor, config, prompt, num_audios=len(audio)
)
# Generate output with audio
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)
Multi-Modal Example (Image + Audio)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load multi-modal model
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
# Prepare inputs
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt,
num_images=len(image),
num_audios=len(audio)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)
Server (FastAPI)
Start the server:
mlx_vlm.server --port 8080
# With trust remote code enabled (required for some models)
mlx_vlm.server --trust-remote-code
Server Options
--host: Host address (default:0.0.0.0)--port: Port number (default:8080)--trust-remote-code: Trust remote code when loading models from Hugging Face Hub
You can also set trust remote code via environment variable:
MLX_TRUST_REMOTE_CODE=true mlx_vlm.server
The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).
Available Endpoints
/models- List models available locally/chat/completions- OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text/responses- OpenAI-compatible responses endpoint/health- Check server status/unload- Unload current model from memory
Usage Examples
List available models
curl "http://localhost:8080/models"
Text Input
curl -X POST "http://localhost:8080/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": "Hello, how are you"
}
],
"stream": true,
"max_tokens": 100
}'
Image Input
curl -X POST "http://localhost:8080/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?"
},
{
"type": "input_image",
"image_url": "/path/to/repo/examples/images/renewables_california.png"
}
]
}
],
"stream": true,
"max_tokens": 1000
}'
Audio Support (New)
curl -X POST "http://localhost:8080/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe what you hear in these audio files" },
{ "type": "input_audio", "input_audio": "/path/to/audio1.wav" },
{ "type": "input_audio", "input_audio": "https://example.com/audio2.mp3" }
]
}
],
"stream": true,
"max_tokens": 500
}'
Multi-Modal (Image + Audio)
curl -X POST "http://localhost:8080/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "input_image", "image_url": "/path/to/image.jpg"},
{"type": "input_audio", "input_audio": "/path/to/audio.wav"}
]
}
],
"max_tokens": 100
}'
Responses Endpoint
curl -X POST "http://localhost:8080/responses" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "What is in this image?"},
{"type": "input_image", "image_url": "/path/to/image.jpg"}
]
}
],
"max_tokens": 100
}'
Request Parameters
model: Model identifier (required)messages: Chat messages for chat/OpenAI endpointsmax_tokens: Maximum tokens to generatetemperature: Sampling temperaturetop_p: Top-p sampling parameterstream: Enable streaming responses
Activation Quantization (CUDA)
When running on NVIDIA GPUs with MLX CUDA, models quantized with mxfp8 or nvfp4 modes require activation quantization to work properly. This converts QuantizedLinear layers to QQLinear layers which quantize both weights and activations.
Command Line
Use the -qa or --quantize-activations flag:
mlx_vlm.generate --model /path/to/mxfp8-model --prompt "Describe this image" --image /path/to/image.jpg -qa
Python API
Pass quantize_activations=True to the load function:
from mlx_vlm import load, generate
# Load with activation quantization enabled
model, processor = load(
"path/to/mxfp8-quantized-model",
quantize_activations=True
)
# Generate as usual
output = generate(model, processor, "Describe this image", image=["image.jpg"])
Supported Quantization Modes
mxfp8- 8-bit MX floating pointnvfp4- 4-bit NVIDIA floating point
Note: This feature is required for mxfp/nvfp quantized models on CUDA. On Apple Silicon (Metal), these models work without the flag.
Multi-Image Chat Support
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
Usage Examples
Python Script
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config
images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)
output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)
Command Line
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg
Video Understanding
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
Supported Models
The following models support video chat:
- Qwen2-VL
- Qwen2.5-VL
- Idefics3
- LLaVA
With more coming soon.
Usage Examples
Command Line
mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
Fine-tuning
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
LoRA & QLoRA
To learn more about LoRA, please refer to the LoRA.md file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_vlm-0.3.12.tar.gz.
File metadata
- Download URL: mlx_vlm-0.3.12.tar.gz
- Upload date:
- Size: 495.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9ee7528ec2765cc02d3115b39a70f0dc1c51345473530981e6386a91f26f379
|
|
| MD5 |
8b5aeb45fd460343164daf2664114108
|
|
| BLAKE2b-256 |
6bcc04a100878abc21aac431221cc6fb79bb69f29e9e61a84284f866340dadfd
|
File details
Details for the file mlx_vlm-0.3.12-py3-none-any.whl.
File metadata
- Download URL: mlx_vlm-0.3.12-py3-none-any.whl
- Upload date:
- Size: 619.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d57cfaf5ee192997f94e7d6fa881be9fc843f3e86f3847940992d595a534d7b5
|
|
| MD5 |
dedfde03b85ab9fe91749bacc25354d9
|
|
| BLAKE2b-256 |
5156e007fd83f5065067dc053b45ad3d86317b82bea3094fa69eb83725875c9e
|