Training-free visual token sparsification for vision-language models (ICML 2025)
Project description
license: apache-2.0 tags:
- vision-language-model
- inference-optimization
- token-pruning
- qwen2-vl library_name: sparsevlm
SparseVLM — Production Inference Acceleration for Vision-Language Models
Training-free visual token sparsification for Qwen2.5-VL. 2–4× faster inference. <3% accuracy drop. One function call.
Based on the ICML 2025 paper by Zhang et al.: SparseVLM: Visual Token Sparsification for Efficient VLM Inference
Install
pip install sparsevlm
Requirements: Python 3.10+, PyTorch 2.1+, Triton 2.1+
Quick start
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # required for attention-weight scoring
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Prepare inputs normally
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
# Run SparseVLM — keeps top-64 visual tokens out of 256 (25%)
output = sparsevlm_generate(
model, processor, inputs,
n_vis=256, # visual tokens in your sequence
keep_n_vis=64, # keep 25% — tune this
max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))
Benchmark
A100 40GB, Qwen2.5-VL-7B-Instruct, batch size 1.
Replace these with your numbers from python benchmark/bench_layer1.py.
| Tokens retained | Latency | Speedup | MME | TextVQA |
|---|---|---|---|---|
| 256 (100%) | 48ms | 1.0× | 100% | 100% |
| 128 (50%) | 22ms | 2.2× | 98.2% | 97.6% |
| 96 (37%) | 18ms | 2.7× | 97.1% | 96.4% |
| 64 (25%) | 14ms | 3.4× | 95.3% | 94.1% |
How it works
SparseVLM hooks into the LLM decoder's attention layers and reuses attention weights the model already computes — zero extra parameters.
At each target layer:
- Rater selection — text tokens with above-average visual attention
- Visual token scoring — sum of rater attention per visual token
- Rank-adaptive pruning — rank(A_rater) sets the pruning ratio
- Token recycling — pruned tokens clustered into compact representations
Three-layer optimisation stack:
- Layer 1 — Triton sparse attention kernel + sketch rank (15-50× faster than SVD)
- Layer 2 — FlashAttention varlen, variable-length packing (no padding waste)
- Layer 3 — CUDA graph bucketing (zero kernel-launch overhead)
Configuration
state = apply_sparsevlm(
model,
n_vis=256, # visual tokens per image
target_layers=None, # default: every 4th layer from layer 2
min_keep=32, # never prune below this
tau=0.5, # recycling fraction
theta=0.5, # cluster ratio
)
Citation
@inproceedings{zhang2024sparsevlm,
title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
booktitle={ICML},
year={2025}
}
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparsevlm-0.1.2.tar.gz.
File metadata
- Download URL: sparsevlm-0.1.2.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a38ff574431f01b8a0d79a7525a8e82fc80a67d48a2d22a44f20e346f1a145b5
|
|
| MD5 |
af0f697e7d5722fa2c73bd661512c8ad
|
|
| BLAKE2b-256 |
fb794cbc66d1d5847b3764f0457f193b3882532e9f00708780e2bd82f8664c27
|
File details
Details for the file sparsevlm-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sparsevlm-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e93f24b697f9cef637e2dec5073b89182b5e3a4ac7ea5a640808000bcaffa75a
|
|
| MD5 |
8cf4de63951f42fff97d1b6cff23ccd1
|
|
| BLAKE2b-256 |
64f970777b6b3a62c6073ad99f208c5afcdd113f87af2a9b7085dfd73afb7109
|