A small example package

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng*

LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. Model and demo of LLaVA-Mini are available now!

[!Note] LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including:

Computational effort: 77% FLOPs reduction

Response latency: reduce from 100 milliseconds to 40 milliseconds

VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing

performance

💡Highlight:

Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our paper for a detailed analysis and our conclusions.

🖥 Demo

llava_mini

Download LLaVA-Mini model from here.

Run these scripts and Interact with LLaVA-Mini in your browser:

# Launch a controller
python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &

# Build the API of LLaVA-Mini
CUDA_VISIBLE_DEVICES=0  python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &

# Start the interactive interface
python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload  --port 7860

🔥 Quick Start

Requirements

Install packages:

conda create -n llavamini python=3.10 -y
conda activate llavamini
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Command Interaction

Image understanding, using --image-file :

# Image Understanding
CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
    --model-path  ICTNLP/llava-mini-llama-3.1-8b \
    --image-file llavamini/serve/examples/baby_cake.png \
    --conv-mode llava_llama_3_1 --model-name "llava-mini" \
    --query "What's the text on the cake?"

Video understanding, using --video-file :

# Video Understanding
CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
    --model-path  ICTNLP/llava-mini-llama-3.1-8b \
    --video-file llavamini/serve/examples/fifa.mp4 \
    --conv-mode llava_llama_3_1 --model-name "llava-mini" \
    --query "What happened in this video?"

Reproduction and Evaluation

Refer to Evaluation.md for the evaluation of LLaVA-Mini on image/video benchmarks.

Cases

LLaVA-Mini achieves high-quality image understanding and video understanding.

case1

More cases

case2

case3

case4

LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

compression

🤝 Acknowledgement

LLaVA: LLaVA-Mini is built upon LLaVA codebase, a large language and vision assistant.
Video-ChatGPT: The training of LLaVA-Mini involves the video instruction data provided by Video-ChatGPT.
LLaVA-OneVision: The training of LLaVA-Mini involves the image instruction data provided by LLaVA-OneVision.

🖋Citation

If this repository is useful for you, please cite as:

@misc{llavamini,
      title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token}, 
      author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
      year={2025},
      eprint={2501.03895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.03895}, 
}

If you have any questions, please feel free to submit an issue or contact zhangshaolei20z@ict.ac.cn.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.1

Feb 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xjhtrans-0.0.1.tar.gz (11.7 MB view details)

Uploaded Feb 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xjhtrans-0.0.1-py3-none-any.whl (13.9 MB view details)

Uploaded Feb 26, 2025 Python 3

File details

Details for the file xjhtrans-0.0.1.tar.gz.

File metadata

Download URL: xjhtrans-0.0.1.tar.gz
Upload date: Feb 26, 2025
Size: 11.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for xjhtrans-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`fbeb3cd93fc0c7759b76691d1aaf0cdd44ddbfeb2252b702d78873360a9c5ee2`
MD5	`6f1eb18316b32e4f7abae782842d7f4f`
BLAKE2b-256	`bb1391b8579384dc21e813dad6ccf5944120b26749f60808497a4798e4569067`

See more details on using hashes here.

File details

Details for the file xjhtrans-0.0.1-py3-none-any.whl.

File metadata

Download URL: xjhtrans-0.0.1-py3-none-any.whl
Upload date: Feb 26, 2025
Size: 13.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for xjhtrans-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2df499682afcfa42489ec86edfbd504cf598c54b863a8606b3bd8e88a1992d1d`
MD5	`d4c2269fec7bf4e4e23ba0381f2c4556`
BLAKE2b-256	`0f30299b47e70cf55f70f7a7021cb513929a473dd474f397c93b3776fa5eaf6f`

See more details on using hashes here.

xjhtrans 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

🖥 Demo

🔥 Quick Start

Requirements

Command Interaction

Reproduction and Evaluation

Cases

🤝 Acknowledgement

🖋Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes