Skip to main content

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Project description

Stanford-Alpaca

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper Code License Demo

TABLE OF CONTENTS

  1. News
  2. Highlights
  3. Video
  4. Demo
  5. Installation
  6. Quick Start
  7. Evaluation
  8. Examples
  9. Citation
  10. Acknowledgement
  11. License

News

  • [2024.11.30] We release Paper and this GitHub repo, including code for LLaVA.

VisionZip: Longer is Better but Not Necessary in Vision Language Models [Paper]
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia

Highlights

Stanford-Alpaca

  1. Our VisionZip is a text-agnostic method that outperforms the current state-of-the-art Efficient VLM methods. By retaining only 10% of visual tokens, it achieves nearly 95% performance.
  2. VisionZip significantly reduces the prefilling time and the total inference time (with KV cache enabled).
  3. Why does this simple, text-agnostic method outperform text-relevant methods? We conduct an in-depth analysis in the paper and provide a demo to visualize these findings.
  4. Since VisionZip is a text-agnostic method that reduces visual tokens before input into the LLM, it can adapt to any existing LLM acceleration algorithms and is applicable to any task that a vanilla VLM can perform, such as multi-turn conversations.
  5. VisionZip can be applied during the inference stage (without incurring any additional training cost), the efficient tuning stage (to achieve better results), and the training stage (saving 2× memory and 2× time).

Video

Stanford-Alpaca

Demo

Speed Improvement

The input video is about the Titanic, and the question is, "What’s the video talking about?"

Stanford-Alpaca

It is important to note that the left side shows the vanilla model, which encodes only 16 frames, while the right side shows our VisionZip, which, despite **encoding 32 frames, is still twice as fast as the vanilla model.**

Stanford-Alpaca

Visualize Redundancy and Misalignment

Stanford-Alpaca

Explore the visual redundancy and feature misalignment in the above Demo. To run it locally, use the following command:

python gradio_demo.py 

Installation

Our code is easy to use.

  1. Install the LLaVA environment.

  2. For formal usage, you can install the package from PyPI by running the following command:

pip install visionzip

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/dvlab-research/VisionZip
cd VisionZip
pip install -e .

Quick Start

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from visionzip import visionzip
model_path = "liuhaotian/llava-v1.5-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)
## VisoinZip retains 54 dominant tokens and 10 contextual tokens
model = visionzip(model, dominant=54, contextual=10)

Evaluation

The evaluation code follows the structure of LLaVA or Lmms-Eval. After loading the model, simply add two lines as shown below:

## Load LLaVA Model (code from llava.eval.model_vqa_loader)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
## add VisionZip
from visionzip import visionzip
model = visionzip(model, dominant=54, contextual=10)

Examples

Multi-turn Conversations

VisionZip, which extracts text-agnostic tokens, is better suited for multi-turn dialogue.

Longer Videos with More Frames

VisionZip reduces the number of visual tokens per frame, allowing more frames to be processed. This improves the model's ability to understand longer videos.

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{None,
  author       = {},
  title        = {},
  booktitle    = {},
  year         = {2024},
}

Acknowledgement

License

  • VisionZip is licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visionzip-0.1.1.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

visionzip-0.1.1-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file visionzip-0.1.1.tar.gz.

File metadata

  • Download URL: visionzip-0.1.1.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for visionzip-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8c89a82605d53a91c985963afa728d1f33f7c321d6b59094977b3fafcbb14e77
MD5 7cce1e9d6064c478c2eeeef60e28697c
BLAKE2b-256 238cb0076b9618ebf2a8a8286f06105cd5a30e33343d9d881078e03e58c2a8bf

See more details on using hashes here.

File details

Details for the file visionzip-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: visionzip-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for visionzip-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bfd906ec746719bdae752af5c3d155baad762ace145db8f82efa99b2d1a88c3
MD5 12ada87844677cb7578cbe3c291b21af
BLAKE2b-256 277ec84f7d940ed331fed1e356bf39895d0de408c913b27a936a3eccf7ca5df9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page