Skip to main content

A small example package

Project description

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

arXiv model

Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng*

LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. Model and demo of LLaVA-Mini are available now!

[!Note] LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including:

  • Computational effort: 77% FLOPs reduction
  • Response latency: reduce from 100 milliseconds to 40 milliseconds
  • VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing

performance

💡Highlight:

  1. Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
  2. High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
  3. Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our paper for a detailed analysis and our conclusions.

🖥 Demo

llava_mini

  • Download LLaVA-Mini model from here.

  • Run these scripts and Interact with LLaVA-Mini in your browser:

    # Launch a controller
    python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &
    
    # Build the API of LLaVA-Mini
    CUDA_VISIBLE_DEVICES=0  python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &
    
    # Start the interactive interface
    python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload  --port 7860
    

🔥 Quick Start

Requirements

  • Install packages:

    conda create -n llavamini python=3.10 -y
    conda activate llavamini
    pip install -e .
    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    

Command Interaction

  • Image understanding, using --image-file :

    # Image Understanding
    CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
        --model-path  ICTNLP/llava-mini-llama-3.1-8b \
        --image-file llavamini/serve/examples/baby_cake.png \
        --conv-mode llava_llama_3_1 --model-name "llava-mini" \
        --query "What's the text on the cake?"
    
  • Video understanding, using --video-file :

    # Video Understanding
    CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
        --model-path  ICTNLP/llava-mini-llama-3.1-8b \
        --video-file llavamini/serve/examples/fifa.mp4 \
        --conv-mode llava_llama_3_1 --model-name "llava-mini" \
        --query "What happened in this video?"
    

Reproduction and Evaluation

  • Refer to Evaluation.md for the evaluation of LLaVA-Mini on image/video benchmarks.

Cases

  • LLaVA-Mini achieves high-quality image understanding and video understanding.

case1

More cases

case2

case3

case4

  • LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

compression

🤝 Acknowledgement

  • LLaVA: LLaVA-Mini is built upon LLaVA codebase, a large language and vision assistant.
  • Video-ChatGPT: The training of LLaVA-Mini involves the video instruction data provided by Video-ChatGPT.
  • LLaVA-OneVision: The training of LLaVA-Mini involves the image instruction data provided by LLaVA-OneVision.

🖋Citation

If this repository is useful for you, please cite as:

@misc{llavamini,
      title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token}, 
      author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
      year={2025},
      eprint={2501.03895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.03895}, 
}

If you have any questions, please feel free to submit an issue or contact zhangshaolei20z@ict.ac.cn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xjhtrans-0.0.1.tar.gz (11.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xjhtrans-0.0.1-py3-none-any.whl (13.9 MB view details)

Uploaded Python 3

File details

Details for the file xjhtrans-0.0.1.tar.gz.

File metadata

  • Download URL: xjhtrans-0.0.1.tar.gz
  • Upload date:
  • Size: 11.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for xjhtrans-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fbeb3cd93fc0c7759b76691d1aaf0cdd44ddbfeb2252b702d78873360a9c5ee2
MD5 6f1eb18316b32e4f7abae782842d7f4f
BLAKE2b-256 bb1391b8579384dc21e813dad6ccf5944120b26749f60808497a4798e4569067

See more details on using hashes here.

File details

Details for the file xjhtrans-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: xjhtrans-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for xjhtrans-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2df499682afcfa42489ec86edfbd504cf598c54b863a8606b3bd8e88a1992d1d
MD5 d4c2269fec7bf4e4e23ba0381f2c4556
BLAKE2b-256 0f30299b47e70cf55f70f7a7021cb513929a473dd474f397c93b3776fa5eaf6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page