Training-free visual token sparsification for vision-language models (ICML 2025)
Project description
license: apache-2.0 tags:
- vision-language-model
- inference-optimization
- token-pruning
- qwen2-vl library_name: sparsevlm
SparseVLM
Training-free visual token pruning for Qwen2.5-VL. Scores visual tokens by how much text attends to them, prunes the unimportant ones from the KV cache, and decodes with the smaller cache.
Based on SparseVLM: Visual Token Sparsification for Efficient VLM Inference (ICML 2025).
Install
pip install sparsevlm
Requirements: Python 3.10+, PyTorch 2.1+, transformers 4.49+
Quick start
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate
from PIL import Image
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
image = Image.open("your_image.jpg")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
# count visual tokens
n_vis = int((inputs["image_grid_thw"][0].prod() / 4).item())
output = sparsevlm_generate(
model, processor, inputs,
n_vis=n_vis,
keep_n_vis=n_vis // 4, # keep 25% of visual tokens
max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))
Benchmark results
Measured on NVIDIA A100-SXM4-40GB, Qwen2.5-VL-7B-Instruct, bfloat16, SDPA attention.
Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)
| Config | Tokens kept | Time | Speedup | Output quality |
|---|---|---|---|---|
| Baseline | 16320 (100%) | 9738ms | 1.00× | ✅ Identifies Fuji, Milky Way, snow cap, star colors |
| SparseVLM 50% | 8192 | 9441ms | 1.03× | ✅ Same quality |
| SparseVLM 25% | 4080 | 9297ms | 1.05× | ✅ All key details preserved |
| SparseVLM 10% | 1632 | 9425ms | 1.03× | ✅ Still correctly describes scene |
Key result: Full 4K image (16K tokens) runs without OOM. Without SparseVLM's hook-based scoring, the 16K-token image requires materialising a 15GB attention matrix and crashes. The scorer computes only the text→visual submatrix (35 × 16320 = 32MB instead of 15GB).
Resized photo (896×504px, 576 visual tokens), batch=1
| Tokens kept | Time | Speedup |
|---|---|---|
| 576 (100%) | 2167ms | 1.00× |
| 288 (50%) | 1685ms | 1.29× |
| 144 (25%) | 1565ms | 1.39× |
| 72 (12%) | 1620ms | 1.34× |
When to expect larger speedup
Speedup grows when the KV cache is large relative to model weights:
| Scenario | Expected speedup |
|---|---|
| Single image, short generation | ~1.1–1.4× |
| Single image, 256+ output tokens | ~1.5–2.5× |
| Batch=32, high-res images | ~2–4× |
| Very long visual context (10K+ tokens) | ~2–4× |
How it works
Token scoring (no extra parameters)
At decoder layer 2, a lightweight hook intercepts the attention projection and computes:
A_tv = Q_text @ K_visual^T # only the text→visual submatrix
# 35 × 16320 instead of 16320 × 16320
score_i = sum over text tokens of attention to visual token i
Visual tokens with high scores are important to the text query. Low-score tokens are pruned from the KV cache before decoding starts.
KV cache pruning
After scoring, the KV cache is sliced to keep only the top-K visual entries plus all text entries. The model then decodes with a smaller cache — fewer keys to attend over per decode step.
Prefill: build KV cache for all 16320 visual tokens
Score: rank each visual token by text attention (32MB op)
Prune: keep top-K, drop the rest
Decode: attend over K + N_text keys instead of 16320 + N_text
Position fix (rope_deltas)
After pruning, Qwen2.5-VL's internal position counter (rope_deltas) is adjusted so decode tokens get correct positional embeddings despite the shorter cache.
API
sparsevlm_generate
from sparsevlm import sparsevlm_generate
output = sparsevlm_generate(
model, # Qwen2_5_VLForConditionalGeneration
processor, # AutoProcessor
inputs, # dict from processor(...)
n_vis, # total visual tokens in the sequence
keep_n_vis, # how many to keep (e.g. n_vis // 4 for 25%)
max_new_tokens=256, # generation length
target_layer=2, # which layer to score from (default 2)
device="cuda", # primary device
)
# returns: token ids [B, max_new_tokens]
apply_sparsevlm / remove_hooks (hook-based API)
from sparsevlm import apply_sparsevlm, reset_n_vis, remove_hooks
state = apply_sparsevlm(model, n_vis=256)
reset_n_vis(state, n_vis=256) # call before each generate
output = model.generate(...)
remove_hooks(state)
Model support
| Model | Status |
|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | ✅ Tested |
| Qwen/Qwen2.5-VL-3B-Instruct | ✅ Should work |
| Qwen/Qwen2.5-VL-72B-Instruct | ✅ Should work |
| Qwen/Qwen2-VL-* | ✅ Legacy support |
Limitations
- Requires
attn_implementation="eager"or"sdpa". Flash Attention 2 (separate package) is not required. - Speedup is modest (~1.1–1.4×) for single-image, short-generation use cases. The gain comes from long generations, high-resolution images, or batched serving.
- Currently tested with Qwen2.5-VL. Other VLM families would need architecture-specific adaptation.
Citation
@inproceedings{zhang2024sparsevlm,
title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
booktitle={ICML},
year={2025}
}
Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparsevlm-0.1.3.tar.gz.
File metadata
- Download URL: sparsevlm-0.1.3.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89a5d91a111970695a9b2a8b89e782b1719f4732317d776f4ec1ae59ef20a358
|
|
| MD5 |
b0876e6099fc1704da2b3dd98d988ffe
|
|
| BLAKE2b-256 |
286490c226cd6066941be0a83d7400c637465662fb698880f5bf14cb75ebcc5b
|
File details
Details for the file sparsevlm-0.1.3-py3-none-any.whl.
File metadata
- Download URL: sparsevlm-0.1.3-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
655e6d168b45e26e0b35909b35a3a9750371319df3b2ad2d87071fa658b68a58
|
|
| MD5 |
d607bd0665067d5e558c56fec63958be
|
|
| BLAKE2b-256 |
9857ed99721e4a1080b19ade78543043e344d21a97c6e142098258b84a4bbe06
|