Skip to main content

Official Codes for of "MANTIS: Interleaved Multi-Image Instruction Tuning"

Project description

Mantis: Multi-Image Instruction Tuning

Mantis


🤔 The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved.

😦 The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective.

🔥 Therefore, we present Mantis, an LLaMA-3 based LMM with interleaved text and image as inputs, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).

🚀 Mantis achieves state-of-the-art performance on 5 multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and maintaining a strong single-image performance on par with CogVLM and Emu2.

🔥News

Installation

conda create -n mantis python=3.10
conda activate mantis
pip install -e .
# install flash-attention
pip install flash-attn --no-build-isolation

Inference

You can run inference with the following command:

cd examples
python run_mantis.py

Training

Install the requirements with the following command:

pip install -e .[train,eval]
cd mantis/train

Our training scripts follows the coding format and model structure of Hugging face. Different from LLaVA Github repo, you can directly load our models from Hugging Face model hub.

Training examples with different data formats

(These example data are all pre-prepared in the data/examples/ folder, so you can check the format of the data and the debug the training script directly. set CUDA_VISIBLE_DEVICES to the GPU you want to use.)

  • training with text-image interleaved data (see example data)
cd mantis/train
bash scripts/train_example_chat.sh # Q-lora, 1 GPU required
  • training with video-text interleaved data (see example data)
cd mantis/train
bash scripts/train_example_video.sh # Q-lora, 1 GPU required
cd mantis/train
bash scripts/train_example_classification.sh # full-finetune, might need 8 GPUs or more

Training examples with different models

We support training of Mantis based on the Fuyu architecture and the LLaVA architecture. You can train the model with the following command:

Training Mantis based on LLaMA3 with CLIP/SigLIP encoder:

  • Pretrain Mantis-LLaMA3 Multimodal projector on pretrain data (Stage 1):
bash scripts/pretrain_mllava.sh
  • Fine-tune the pretrained Mantis-LLaMA3 on Mantis-Instruct (Stage 2):
bash scripts/train_mllava.sh

Training Mantis based on Fuyu-8B:

  • Fine-tune Fuyu-8B on Mantis-Instruct to get Mantis-Fuyu:
bash scripts/train_fuyu.sh

Note:

  • Our training scripts contain auto inference bash commands to infer the number of GPUs and the number of GPU nodes use for the training. So you only need to modify the data config path and the base models.
  • The training data will be automatically downloaded from hugging face when you run the training scripts.

See mantis/train/README.md for more details.

Check all the training scripts in mantist/train/scripts

Evaluation

To reproduce our evaluation results, please check mantis/benchmark/README.md

Data

Downloading

you can easily preparing Mantis-Insturct's downloading with the following command (The downloading and extracting might take about an hour):

python data/download_mantis_instruct.py --max_workers 8

Model Zoo

Mantis Models

We provide the following models in the 🤗 Hugging Face model hub:

Run models

  • Run Mantis-8B-Idefics2:
cd examples && python run_mantis_idefics2.py
  • Mantis-8B-siglip-llama3:
cd examples && python run_mantis.py
  • Mantis-8B-Fuyu:
cd examples && python run_mantis_fuyu.py

Chat CLI

We provide a simple chat CLI for Mantis models. You can run the following command to chat with Mantis-8B-siglip-llama3:

python examples/chat_mantis.py

Intermediate Checkpoints

The following intermediate checkpoints after pre-training the multi-modal projectors are also available for experiments reproducibility (Please note the follwing checkpoints still needs further fine-tuning on Mantis-Eval to be intelligent. They are not working models.):

Acknowledgement

  • Thanks LLaVA and LLaVA-hf team for providing the LLaVA codebase, and hugging face compatibility!
  • Thanks Haoning Wu for providing codes of MVBench evaluation!

Star History

Star History Chart

Citation

@article{jiang2024mantis,
  title={MANTIS: Interleaved Multi-Image Instruction Tuning},
  author={Jiang, Dongfu and He, Xuan and Zeng, Huaye and Wei, Con and Ku, Max and Liu, Qian and Chen, Wenhu},
  journal={arXiv preprint arXiv:2405.01483},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mantis_vl-0.0.4.tar.gz (259.1 kB view details)

Uploaded Source

Built Distribution

mantis_vl-0.0.4-py3-none-any.whl (331.8 kB view details)

Uploaded Python 3

File details

Details for the file mantis_vl-0.0.4.tar.gz.

File metadata

  • Download URL: mantis_vl-0.0.4.tar.gz
  • Upload date:
  • Size: 259.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for mantis_vl-0.0.4.tar.gz
Algorithm Hash digest
SHA256 7014f5c3c98e29d8e70df7886eb119c2cf7da7dbf7d2ef343779e15489e44b2f
MD5 2e7860f35d48a6e927e154c4dd3b29c9
BLAKE2b-256 a81c03000d6a8f497328074c70e9c40e02a49a69b1ff0cce0f0c21d9ae3737f7

See more details on using hashes here.

File details

Details for the file mantis_vl-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: mantis_vl-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 331.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for mantis_vl-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 870acdd29c686bfda49ac5800e79a6c32bc1b0fa204b2568d961396c18234e32
MD5 d86352c968ae728797818dba2c7781e2
BLAKE2b-256 66a336ec873132b422832128205aa6919a5014bce3550f8d87675067220a7a10

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page