Skip to main content

Molmo Utils - PyTorch

Project description

molmo-utils

Molmo Utils contains a set of helper functions for processing and integrating visual inputs with Molmo, Ai2’s state-of-the-art multimodal open language models.

Installation

pip install molmo-utils          # basic usage
pip install molmo-utils[torchcodec]  # recommended for video inputs

Usage

Molmo2

from transformers import AutoProcessor, AutoModelForImageTextToText
from molmo_utils import process_vision_info

model_path = "allenai/Molmo2-8B"

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
    dtype="auto",
    device_map="auto",
)

# You can directly use a local file path, a URL, or a base64-encoded image.
# The processed visual tokens will always be inserted at the beginning of the input sequence.

messages = [
    # Image
    ## Local file path
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "file:///path/to/your/image.jpg"},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    ## Image URL
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "http://path/to/your/image.jpg"},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    ## Base64-encoded image
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "data:image;base64,/9j/..."},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    ## PIL.Image.Image
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": pil_image},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    # Video
    ## Local video path
    [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "file:///path/to/video1.mp4"},
                {"type": "text", "text": "Describe this video."},
            ],
        }
    ],
    ## Local video frames (timestamps must be provided)
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": [
                        "file:///path/to/extracted_frame1.jpg",
                        "file:///path/to/extracted_frame2.jpg",
                        "file:///path/to/extracted_frame3.jpg",
                    ],
                    "timestamps": [0.0, 0.5, 1.0],
                },
                {"type": "text", "text": "Describe this video."},
            ],
        }
    ],
    ## The model dynamically adjusts the frame sampling mode, maximum number of frames,
    ## maximum sampling FPS, etc.
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "file:///path/to/video1.mp4",
                    "frame_sampling_mode": "uniform_last_frame",
                    "num_frames": 384,
                    "max_fps": 8.0,
                },
                {"type": "text", "text": "Describe this video."},
            ],
        }
    ],
]


text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

images, videos, video_kwargs = process_vision_info(messages)

if videos is not None:
    videos, video_metadatas = zip(*videos)
    videos = list(videos)
    video_metadatas = list(video_metadatas)
else:
    video_metadatas = None

inputs = processor(
    text=text,
    images=images,
    videos=videos,
    video_metadata=video_metadatas,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_text = processor.post_process_image_text_to_text(
    generated_ids[:, inputs["input_ids"].size(1):],
    skip_special_tokens=True,
)
print(generated_text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molmo_utils-0.0.1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molmo_utils-0.0.1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file molmo_utils-0.0.1.tar.gz.

File metadata

  • Download URL: molmo_utils-0.0.1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for molmo_utils-0.0.1.tar.gz
Algorithm Hash digest
SHA256 f87ef5856e1191b4b4dd4c7c346a221b1089c8824d35d277f9e69a584e3f01d4
MD5 7bf50624fd09ac7fbd18f0bb0dbb762b
BLAKE2b-256 6f315241bb2e9862e2df80234a7dc8a9038a38c9f540310c91f7c87fd444c72b

See more details on using hashes here.

File details

Details for the file molmo_utils-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: molmo_utils-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for molmo_utils-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0f3553eac2e52c64dba434f874b32408f7ada244629aaf161e24ff33c0ce5ef5
MD5 23f224c8d652f516dc338c5f67b587e4
BLAKE2b-256 adeb40792e853a45c0db13bc4f1a55536aa5f35b46588d262bfa56bbb110db0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page