Skip to main content

Training-free visual token sparsification for vision-language models (ICML 2025)

Project description


license: apache-2.0 tags:

  • vision-language-model
  • inference-optimization
  • token-pruning
  • qwen2-vl library_name: sparsevlm

SparseVLM

PyPI Paper License Tests

Training-free visual token pruning for Qwen2.5-VL. Scores visual tokens by how much text attends to them, prunes the unimportant ones from the KV cache, and decodes with the smaller cache.

Based on SparseVLM: Visual Token Sparsification for Efficient VLM Inference (ICML 2025).


Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, transformers 4.49+


Quick start

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate
from PIL import Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

image = Image.open("your_image.jpg")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this image in detail."}
]}]
text   = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# count visual tokens
n_vis = int((inputs["image_grid_thw"][0].prod() / 4).item())

output = sparsevlm_generate(
    model, processor, inputs,
    n_vis=n_vis,
    keep_n_vis=n_vis // 4,   # keep 25% of visual tokens
    max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))

Benchmark results

Measured on NVIDIA A100-SXM4-40GB, Qwen2.5-VL-7B-Instruct, bfloat16, SDPA attention.

Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)

Config Tokens kept Time Speedup Output quality
Baseline 16320 (100%) 9738ms 1.00× ✅ Identifies Fuji, Milky Way, snow cap, star colors
SparseVLM 50% 8192 9441ms 1.03× ✅ Same quality
SparseVLM 25% 4080 9297ms 1.05× ✅ All key details preserved
SparseVLM 10% 1632 9425ms 1.03× ✅ Still correctly describes scene

Key result: Full 4K image (16K tokens) runs without OOM. Without SparseVLM's hook-based scoring, the 16K-token image requires materialising a 15GB attention matrix and crashes. The scorer computes only the text→visual submatrix (35 × 16320 = 32MB instead of 15GB).

Resized photo (896×504px, 576 visual tokens), batch=1

Tokens kept Time Speedup
576 (100%) 2167ms 1.00×
288 (50%) 1685ms 1.29×
144 (25%) 1565ms 1.39×
72 (12%) 1620ms 1.34×

When to expect larger speedup

Speedup grows when the KV cache is large relative to model weights:

Scenario Expected speedup
Single image, short generation ~1.1–1.4×
Single image, 256+ output tokens ~1.5–2.5×
Batch=32, high-res images ~2–4×
Very long visual context (10K+ tokens) ~2–4×

How it works

Token scoring (no extra parameters)

At decoder layer 2, a lightweight hook intercepts the attention projection and computes:

A_tv = Q_text @ K_visual^T   # only the text→visual submatrix
                              # 35 × 16320 instead of 16320 × 16320
score_i = sum over text tokens of attention to visual token i

Visual tokens with high scores are important to the text query. Low-score tokens are pruned from the KV cache before decoding starts.

KV cache pruning

After scoring, the KV cache is sliced to keep only the top-K visual entries plus all text entries. The model then decodes with a smaller cache — fewer keys to attend over per decode step.

Prefill:  build KV cache for all 16320 visual tokens
Score:    rank each visual token by text attention (32MB op)
Prune:    keep top-K, drop the rest
Decode:   attend over K + N_text keys instead of 16320 + N_text

Position fix (rope_deltas)

After pruning, Qwen2.5-VL's internal position counter (rope_deltas) is adjusted so decode tokens get correct positional embeddings despite the shorter cache.


API

sparsevlm_generate

from sparsevlm import sparsevlm_generate

output = sparsevlm_generate(
    model,                  # Qwen2_5_VLForConditionalGeneration
    processor,              # AutoProcessor
    inputs,                 # dict from processor(...)
    n_vis,                  # total visual tokens in the sequence
    keep_n_vis,             # how many to keep (e.g. n_vis // 4 for 25%)
    max_new_tokens=256,     # generation length
    target_layer=2,         # which layer to score from (default 2)
    device="cuda",          # primary device
)
# returns: token ids [B, max_new_tokens]

apply_sparsevlm / remove_hooks (hook-based API)

from sparsevlm import apply_sparsevlm, reset_n_vis, remove_hooks

state = apply_sparsevlm(model, n_vis=256)
reset_n_vis(state, n_vis=256)   # call before each generate
output = model.generate(...)
remove_hooks(state)

Model support

Model Status
Qwen/Qwen2.5-VL-7B-Instruct ✅ Tested
Qwen/Qwen2.5-VL-3B-Instruct ✅ Should work
Qwen/Qwen2.5-VL-72B-Instruct ✅ Should work
Qwen/Qwen2-VL-* ✅ Legacy support

Limitations

  • Requires attn_implementation="eager" or "sdpa". Flash Attention 2 (separate package) is not required.
  • Speedup is modest (~1.1–1.4×) for single-image, short-generation use cases. The gain comes from long generations, high-resolution images, or batched serving.
  • Currently tested with Qwen2.5-VL. Other VLM families would need architecture-specific adaptation.

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsevlm-0.1.3.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparsevlm-0.1.3-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file sparsevlm-0.1.3.tar.gz.

File metadata

  • Download URL: sparsevlm-0.1.3.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 89a5d91a111970695a9b2a8b89e782b1719f4732317d776f4ec1ae59ef20a358
MD5 b0876e6099fc1704da2b3dd98d988ffe
BLAKE2b-256 286490c226cd6066941be0a83d7400c637465662fb698880f5bf14cb75ebcc5b

See more details on using hashes here.

File details

Details for the file sparsevlm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: sparsevlm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 655e6d168b45e26e0b35909b35a3a9750371319df3b2ad2d87071fa658b68a58
MD5 d607bd0665067d5e558c56fec63958be
BLAKE2b-256 9857ed99721e4a1080b19ade78543043e344d21a97c6e142098258b84a4bbe06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page