MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

MLX-VLM

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.

Installation
Usage
Activation Quantization (CUDA)
Multi-Image Chat Support
- Supported Models
- Usage Examples
Model-Specific Documentation
TurboQuant KV Cache
Fine-tuning

Model-Specific Documentation

Some models have detailed documentation with prompt formats, examples, and best practices:

Model	Documentation
DeepSeek-OCR	Docs
DeepSeek-OCR-2	Docs
DOTS-OCR	Docs
DOTS-MOCR	Docs
GLM-OCR	Docs
Phi-4 Reasoning Vision	Docs
MiniCPM-o	Docs
Phi-4 Multimodal	Docs
MolmoPoint	Docs
Moondream3	Docs
Gemma 4	Docs
Falcon-OCR	Docs
Granite Vision 3.2	Docs
Granite 4.0 Vision	Docs

Installation

The easiest way to get started is to install the mlx-vlm package using pip:

pip install -U mlx-vlm

Usage

Command Line Interface (CLI)

Generate output from a model using the CLI:

# Text generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"

# Image generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

# Audio generation (New)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav

# Multi-modal generation (Image + Audio)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav

Thinking Budget

For thinking models (e.g., Qwen3.5), you can limit the number of tokens spent in the thinking block:

mlx_vlm.generate --model mlx-community/Qwen3.5-2B-4bit \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --enable-thinking \
  --prompt "Solve 2+2"

Flag	Description
`--enable-thinking`	Activate thinking mode in the chat template
`--thinking-budget`	Max tokens allowed inside the thinking block
`--thinking-start-token`	Token that opens a thinking block (default: `<think>`)
`--thinking-end-token`	Token that closes a thinking block (default: `</think>`)

When the budget is exceeded, the model is forced to emit \n</think> and transition to the answer. If --enable-thinking is passed but the model's chat template does not support it, the budget is applied only if the model generates the start token on its own.

Chat UI with Gradio

Launch a chat interface using Gradio:

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python Script

Here's an example of how to use MLX-VLM in a Python script:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

Audio Example

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model with audio support
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# Prepare audio input
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

# Apply chat template with audio
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_audios=len(audio)
)

# Generate output with audio
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)

Multi-Modal Example (Image + Audio)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load multi-modal model
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# Prepare inputs
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt,
    num_images=len(image),
    num_audios=len(audio)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)

Server (FastAPI)

Start the server:

mlx_vlm.server --port 8080

# Preload a model at startup (Hugging Face repo or local path)
mlx_vlm.server --model <hf_repo_or_local_path>

# Preload a model with adapter
mlx_vlm.server --model <hf_repo_or_local_path> --adapter-path <adapter_path>

# With trust remote code enabled (required for some models)
mlx_vlm.server --trust-remote-code

Server Options

--model: Preload a model at server startup, accepts a Hugging Face repo ID or local path (optional, loads lazily on first request if omitted)
--adapter-path: Path for adapter weights to use with the preloaded model
--host: Host address (default: 0.0.0.0)
--port: Port number (default: 8080)
--trust-remote-code: Trust remote code when loading models from Hugging Face Hub

You can also set trust remote code via environment variable:

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).

Available Endpoints

/models and /v1/models - List models available locally
/chat/completions and /v1/chat/completions - OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text
/responses and /v1/responses - OpenAI-compatible responses endpoint
/health - Check server status
/unload - Unload current model from memory

Usage Examples

List available models

curl "http://localhost:8080/models"

Text Input

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you"
      }
    ],
    "stream": true,
    "max_tokens": 100
  }'

Image Input

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
    "messages":
    [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?"
          },
          {
            "type": "input_image",
            "image_url": "/path/to/repo/examples/images/renewables_california.png"
          }
        ]
      }
    ],
    "stream": true,
    "max_tokens": 1000
  }'

Audio Support (New)

curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe what you hear in these audio files" },
          { "type": "input_audio", "input_audio": "/path/to/audio1.wav" },
          { "type": "input_audio", "input_audio": "https://example.com/audio2.mp3" }
        ]
      }
    ],
    "stream": true,
    "max_tokens": 500
  }'

Multi-Modal (Image + Audio)

curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_image", "image_url": "/path/to/image.jpg"},
          {"type": "input_audio", "input_audio": "/path/to/audio.wav"}
        ]
      }
    ],
    "max_tokens": 100
  }'

Responses Endpoint

curl -X POST "http://localhost:8080/responses" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is in this image?"},
          {"type": "input_image", "image_url": "/path/to/image.jpg"}
        ]
      }
    ],
    "max_tokens": 100
  }'

Request Parameters

model: Model identifier (required)
messages: Chat messages for chat/OpenAI endpoints
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_p: Top-p sampling parameter
top_k: Top-k sampling cutoff
min_p: Min-p sampling threshold
repetition_penalty: Penalty applied to repeated tokens
stream: Enable streaming responses

Activation Quantization (CUDA)

When running on NVIDIA GPUs with MLX CUDA, models quantized with mxfp8 or nvfp4 modes require activation quantization to work properly. This converts QuantizedLinear layers to QQLinear layers which quantize both weights and activations.

Command Line

Use the -qa or --quantize-activations flag:

mlx_vlm.generate --model /path/to/mxfp8-model --prompt "Describe this image" --image /path/to/image.jpg -qa

Python API

Pass quantize_activations=True to the load function:

from mlx_vlm import load, generate

# Load with activation quantization enabled
model, processor = load(
    "path/to/mxfp8-quantized-model",
    quantize_activations=True
)

# Generate as usual
output = generate(model, processor, "Describe this image", image=["image.jpg"])

Supported Quantization Modes

mxfp8 - 8-bit MX floating point
nvfp4 - 4-bit NVIDIA floating point

Note: This feature is required for mxfp/nvfp quantized models on CUDA. On Apple Silicon (Metal), these models work without the flag.

Multi-Image Chat Support

MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.

Usage Examples

Python Script

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config

images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)

Command Line

mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg

Video Understanding

MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.

Supported Models

The following models support video chat:

Qwen2-VL
Qwen2.5-VL
Idefics3
LLaVA

With more coming soon.

Usage Examples

Command Line

mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0

These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.

TurboQuant KV Cache

TurboQuant compresses the KV cache during generation, enabling longer context lengths with less memory while maintaining quality.

Quick Start

# 3.5-bit KV cache quantization (3-bit keys + 4-bit values)
mlx_vlm generate \
  --model mlx-community/Qwen3.5-4B-4bit \
  --kv-bits 3.5 \
  --kv-quant-scheme turboquant \
  --prompt "Your long prompt here..."

from mlx_vlm import generate

result = generate(
    model, processor, prompt,
    kv_bits=3.5,
    kv_quant_scheme="turboquant",
    max_tokens=256,
)

How It Works

TurboQuant uses random rotation + codebook quantization (arXiv:2504.19874) to compress KV cache entries from 16-bit to 2-4 bits per dimension:

Keys: ProdCodec (MSE codebook + QJL sign residual) for accurate attention scoring
Values: MSE codebook for reconstruction quality
Fractional bits (e.g. 3.5): uses lower bits for keys, higher for values (3-bit K + 4-bit V)

Custom Metal kernels fuse score computation and value aggregation directly on packed quantized data, avoiding full dequantization during decode.

Performance

Tested on Qwen3.5-4B-4bit at 128k context:

Metric	Baseline	TurboQuant 3.5-bit
KV Memory	4.1 GB	0.97 GB (76% reduction)
Peak Memory	18.3 GB	17.3 GB (-1.0 GB)

At 512k+ contexts, TurboQuant's per-layer attention is faster than FP16 SDPA due to reduced memory bandwidth requirements.

Tested on gemma-4-31b-it at 128k context:

Metric	Baseline	TurboQuant 3.5-bit
KV Memory	13.3 GB	4.9 GB (63% reduction)
Peak Memory	75.2 GB	65.8 GB (-9.4 GB)

Supported Bit Widths

Bits	Compression	Best For
2	~8x	Maximum compression, some quality loss
3	~5x	Good balance of quality and compression
3.5	~4.5x	Recommended default (3-bit keys + 4-bit values)
4	~4x	Best quality, moderate compression

Compatibility

TurboQuant automatically quantizes KVCache layers (global attention). Models with RotatingKVCache (sliding window) or ArraysCache (MLA/absorbed keys) keep their native cache format for those layers since they are already memory-efficient.

Fine-tuning

MLX-VLM supports fine-tuning models with LoRA and QLoRA.

LoRA & QLoRA

To learn more about LoRA, please refer to the LoRA.md file.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.4.4

Apr 4, 2026

This version

0.4.3

Apr 2, 2026

0.4.2

Mar 28, 2026

0.4.1

Mar 21, 2026

0.4.0

Mar 7, 2026

0.3.12

Feb 16, 2026

0.3.11

Feb 4, 2026

0.3.10

Jan 28, 2026

0.3.9

Dec 3, 2025

0.3.8

Nov 27, 2025

0.3.7

Nov 17, 2025

0.3.6

Nov 14, 2025

0.3.5

Oct 26, 2025

0.3.4

Oct 14, 2025

0.3.3

Aug 20, 2025

0.3.2

Jul 22, 2025

0.3.1

Jul 12, 2025

0.3.0

Jul 5, 2025

0.2.0

Jun 26, 2025

0.1.27

Jun 8, 2025

0.1.26

May 13, 2025

0.1.25

Apr 21, 2025

0.1.24

Apr 20, 2025

0.1.23

Apr 17, 2025

0.1.22

Apr 12, 2025

0.1.21

Mar 21, 2025

0.1.20

Mar 20, 2025

0.1.19

Mar 19, 2025

0.1.18

Mar 18, 2025

0.1.17

Mar 12, 2025

0.1.16

Mar 5, 2025

0.1.15

Mar 5, 2025

0.1.14

Feb 20, 2025

0.1.13

Feb 1, 2025

0.1.12

Jan 29, 2025

0.1.11

Jan 18, 2025

0.1.10

Jan 3, 2025

0.1.9

Dec 31, 2024

0.1.8

Dec 30, 2024

0.1.7

Dec 30, 2024

0.1.6

Dec 22, 2024

0.1.5

Dec 22, 2024

0.1.4

Dec 5, 2024

0.1.3

Nov 28, 2024

0.1.2

Nov 26, 2024

0.1.1

Nov 23, 2024

0.1.0

Oct 18, 2024

0.0.15

Sep 29, 2024

0.0.14

Sep 28, 2024

0.0.13

Aug 16, 2024

0.0.12

Aug 2, 2024

0.0.11

Jul 4, 2024

0.0.10

Jun 24, 2024

0.0.9

Jun 22, 2024

0.0.8

Jun 8, 2024

0.0.7

May 25, 2024

0.0.6

May 24, 2024

0.0.5

May 7, 2024

0.0.4

May 3, 2024

0.0.3

Apr 23, 2024

0.0.2

Apr 23, 2024

0.0.1

Apr 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_vlm-0.4.3.tar.gz (816.7 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_vlm-0.4.3-py3-none-any.whl (995.7 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file mlx_vlm-0.4.3.tar.gz.

File metadata

Download URL: mlx_vlm-0.4.3.tar.gz
Upload date: Apr 2, 2026
Size: 816.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for mlx_vlm-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`70c020edbc629ec6091f530a93d913b328b2c74f83da661322ba6e1bcbdf5035`
MD5	`79d2f0cd3d0791f7055a96ee86521849`
BLAKE2b-256	`ddd2cc80916001d73d7f877360054c2fcb5ca54a07fcfee8a325e6d121b3d98b`

See more details on using hashes here.

File details

Details for the file mlx_vlm-0.4.3-py3-none-any.whl.

File metadata

Download URL: mlx_vlm-0.4.3-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 995.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for mlx_vlm-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`225d1b6b5467c15cce92e681f54c3f3c58bbcc180fc0aeefa35386f2db56bb08`
MD5	`b3f29c929281a31a752558b40b59b21b`
BLAKE2b-256	`58491815a7a74070e923140a065c81de5737e007f525fcd9f8d922a3584e301e`

See more details on using hashes here.

mlx-vlm 0.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLX-VLM

Table of Contents

Model-Specific Documentation

Installation

Usage

Command Line Interface (CLI)

Thinking Budget

Chat UI with Gradio

Python Script

Audio Example

Multi-Modal Example (Image + Audio)

Server (FastAPI)

Server Options

Available Endpoints

Usage Examples

List available models

Text Input

Image Input

Audio Support (New)

Multi-Modal (Image + Audio)

Responses Endpoint

Request Parameters

Activation Quantization (CUDA)

Command Line

Python API

Supported Quantization Modes

Multi-Image Chat Support

Usage Examples

Python Script

Command Line

Video Understanding

Supported Models

Usage Examples

Command Line

TurboQuant KV Cache

Quick Start

How It Works

Performance

Supported Bit Widths

Compatibility

Fine-tuning

LoRA & QLoRA

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes