A small example package
Project description
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. Model and demo of LLaVA-Mini are available now!
[!Note] LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including:
- Computational effort: 77% FLOPs reduction
- Response latency: reduce from 100 milliseconds to 40 milliseconds
- VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing
💡Highlight:
- Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
- High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
- Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our paper for a detailed analysis and our conclusions.
🖥 Demo
-
Download LLaVA-Mini model from here.
-
Run these scripts and Interact with LLaVA-Mini in your browser:
# Launch a controller python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 & # Build the API of LLaVA-Mini CUDA_VISIBLE_DEVICES=0 python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini & # Start the interactive interface python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --port 7860
🔥 Quick Start
Requirements
-
Install packages:
conda create -n llavamini python=3.10 -y conda activate llavamini pip install -e . pip install -e ".[train]" pip install flash-attn --no-build-isolation
Command Interaction
-
Image understanding, using
--image-file:# Image Understanding CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \ --model-path ICTNLP/llava-mini-llama-3.1-8b \ --image-file llavamini/serve/examples/baby_cake.png \ --conv-mode llava_llama_3_1 --model-name "llava-mini" \ --query "What's the text on the cake?"
-
Video understanding, using
--video-file:# Video Understanding CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \ --model-path ICTNLP/llava-mini-llama-3.1-8b \ --video-file llavamini/serve/examples/fifa.mp4 \ --conv-mode llava_llama_3_1 --model-name "llava-mini" \ --query "What happened in this video?"
Reproduction and Evaluation
- Refer to Evaluation.md for the evaluation of LLaVA-Mini on image/video benchmarks.
Cases
- LLaVA-Mini achieves high-quality image understanding and video understanding.
More cases
- LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).
🤝 Acknowledgement
- LLaVA: LLaVA-Mini is built upon LLaVA codebase, a large language and vision assistant.
- Video-ChatGPT: The training of LLaVA-Mini involves the video instruction data provided by Video-ChatGPT.
- LLaVA-OneVision: The training of LLaVA-Mini involves the image instruction data provided by LLaVA-OneVision.
🖋Citation
If this repository is useful for you, please cite as:
@misc{llavamini,
title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},
author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
year={2025},
eprint={2501.03895},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.03895},
}
If you have any questions, please feel free to submit an issue or contact zhangshaolei20z@ict.ac.cn.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xjhtrans-0.0.1.tar.gz.
File metadata
- Download URL: xjhtrans-0.0.1.tar.gz
- Upload date:
- Size: 11.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbeb3cd93fc0c7759b76691d1aaf0cdd44ddbfeb2252b702d78873360a9c5ee2
|
|
| MD5 |
6f1eb18316b32e4f7abae782842d7f4f
|
|
| BLAKE2b-256 |
bb1391b8579384dc21e813dad6ccf5944120b26749f60808497a4798e4569067
|
File details
Details for the file xjhtrans-0.0.1-py3-none-any.whl.
File metadata
- Download URL: xjhtrans-0.0.1-py3-none-any.whl
- Upload date:
- Size: 13.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2df499682afcfa42489ec86edfbd504cf598c54b863a8606b3bd8e88a1992d1d
|
|
| MD5 |
d4c2269fec7bf4e4e23ba0381f2c4556
|
|
| BLAKE2b-256 |
0f30299b47e70cf55f70f7a7021cb513929a473dd474f397c93b3776fa5eaf6f
|