Skip to main content

fork of mlx-vlm for fount

Project description

Upload Python Package

MLX-VLM

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.

Table of Contents

Model-Specific Documentation

Some models have detailed documentation with prompt formats, examples, and best practices:

Model Documentation
DeepSeek-OCR Docs
DeepSeek-OCR-2 Docs
GLM-OCR Docs

Installation

The easiest way to get started is to install the mlx-vlm package using pip:

pip install -U mlx-vlm

Usage

Command Line Interface (CLI)

Generate output from a model using the CLI:

# Text generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"

# Image generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

# Audio generation (New)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav

# Multi-modal generation (Image + Audio)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav

Chat UI with Gradio

Launch a chat interface using Gradio:

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python Script

Here's an example of how to use MLX-VLM in a Python script:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

Audio Example

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load model with audio support
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# Prepare audio input
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

# Apply chat template with audio
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_audios=len(audio)
)

# Generate output with audio
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)

Multi-Modal Example (Image + Audio)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load multi-modal model
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# Prepare inputs
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt,
    num_images=len(image),
    num_audios=len(audio)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)

Server (FastAPI)

Start the server:

mlx_vlm.server --port 8080

# With trust remote code enabled (required for some models)
mlx_vlm.server --trust-remote-code

Server Options

  • --host: Host address (default: 0.0.0.0)
  • --port: Port number (default: 8080)
  • --trust-remote-code: Trust remote code when loading models from Hugging Face Hub

You can also set trust remote code via environment variable:

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).

Available Endpoints

  • /models - List models available locally
  • /chat/completions - OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text
  • /responses - OpenAI-compatible responses endpoint
  • /health - Check server status
  • /unload - Unload current model from memory

Usage Examples

List available models
curl "http://localhost:8080/models"
Text Input
curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you",
      }
    ],
    "stream": true,
    "max_tokens": 100
  }'
Image Input
curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
    [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?"
          },
          {
            "type": "input_image",
            "image_url": "/path/to/repo/examples/images/renewables_california.png"
          }
        ]
      }
    ],
    "stream": true,
    "max_tokens": 1000
  }'
Audio Support (New)
curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe what you hear in these audio files" },
          {"type": "input_audio", "input_audio": "/path/to/audio1.wav"}
          {"type": "input_audio", "input_audio": "https://example.com/audio2.mp3"}
        ]
      }
    ],
    "stream": true,
    "max_tokens": 500
  }'
Multi-Modal (Image + Audio)
curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_image", "image_url": "/path/to/image.jpg"},
          {"type": "input_audio", "input_audio": "/path/to/audio.wav"}
        ]
      }
    ],
    "max_tokens": 100
  }'
Responses Endpoint
curl -X POST "http://localhost:8080/responses" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is in this image?"},
          {"type": "input_image", "image_url": "/path/to/image.jpg"}
        ]
      }
    ],
    "max_tokens": 100
  }'

Request Parameters

  • model: Model identifier (required)
  • messages: Chat messages for chat/OpenAI endpoints
  • max_tokens: Maximum tokens to generate
  • temperature: Sampling temperature
  • top_p: Top-p sampling parameter
  • stream: Enable streaming responses

Multi-Image Chat Support

MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.

Usage Examples

Python Script

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config

images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)

Command Line

mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg

Video Understanding

MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.

Supported Models

The following models support video chat:

  1. Qwen2-VL
  2. Qwen2.5-VL
  3. Idefics3
  4. LLaVA

With more coming soon.

Usage Examples

Command Line

mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0

These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.

Fine-tuning

MLX-VLM supports fine-tuning models with LoRA and QLoRA.

LoRA & QLoRA

To learn more about LoRA, please refer to the LoRA.md file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fount_vlm_nell_02-0.3.11.tar.gz (483.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fount_vlm_nell_02-0.3.11-py3-none-any.whl (607.1 kB view details)

Uploaded Python 3

File details

Details for the file fount_vlm_nell_02-0.3.11.tar.gz.

File metadata

  • Download URL: fount_vlm_nell_02-0.3.11.tar.gz
  • Upload date:
  • Size: 483.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for fount_vlm_nell_02-0.3.11.tar.gz
Algorithm Hash digest
SHA256 732489c66e3af18f457a9edb3686a16e4772c8079b9a549c6230eedb653cad89
MD5 5adc92d2fd2b69e04459637804125d99
BLAKE2b-256 daa0f3fbace444787d95edf7efc3b2a3d31def806563aeabe42c1a55a94e9b5b

See more details on using hashes here.

File details

Details for the file fount_vlm_nell_02-0.3.11-py3-none-any.whl.

File metadata

File hashes

Hashes for fount_vlm_nell_02-0.3.11-py3-none-any.whl
Algorithm Hash digest
SHA256 bb2f1445c8abe0b910a7470f9faf1e9b4740a7daa9a20aac77c347c8811efc98
MD5 e4179f7a950d7c686113c15ef9340739
BLAKE2b-256 d4a78c2e9a2c9fdacaf6651345b1f2bb6fede25ac2b38e69fba69c3101b3c329

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page