Training-free visual token sparsification for vision-language models (ICML 2025)

These details have not been verified by PyPI

Project links

Project description

license: apache-2.0 tags:

vision-language-model
inference-optimization
token-pruning
qwen2-vl library_name: sparsevlm

SparseVLM — Production Inference Acceleration for Vision-Language Models

Training-free visual token sparsification for Qwen2.5-VL. 2–4× faster inference. <3% accuracy drop. One function call.

Based on the ICML 2025 paper by Zhang et al.: SparseVLM: Visual Token Sparsification for Efficient VLM Inference

Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, Triton 2.1+

Quick start

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",   # required for attention-weight scoring
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare inputs normally
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this image."}
]}]
text   = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# Run SparseVLM — keeps top-64 visual tokens out of 256 (25%)
output = sparsevlm_generate(
    model, processor, inputs,
    n_vis=256,          # visual tokens in your sequence
    keep_n_vis=64,      # keep 25% — tune this
    max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))

Benchmark

A100 40GB, Qwen2.5-VL-7B-Instruct, batch size 1. Replace these with your numbers from python benchmark/bench_layer1.py.

Tokens retained	Latency	Speedup	MME	TextVQA
256 (100%)	48ms	1.0×	100%	100%
128 (50%)	22ms	2.2×	98.2%	97.6%
96 (37%)	18ms	2.7×	97.1%	96.4%
64 (25%)	14ms	3.4×	95.3%	94.1%

How it works

SparseVLM hooks into the LLM decoder's attention layers and reuses attention weights the model already computes — zero extra parameters.

At each target layer:

Rater selection — text tokens with above-average visual attention
Visual token scoring — sum of rater attention per visual token
Rank-adaptive pruning — rank(A_rater) sets the pruning ratio
Token recycling — pruned tokens clustered into compact representations

Three-layer optimisation stack:

Layer 1 — Triton sparse attention kernel + sketch rank (15-50× faster than SVD)
Layer 2 — FlashAttention varlen, variable-length packing (no padding waste)
Layer 3 — CUDA graph bucketing (zero kernel-launch overhead)

Configuration

state = apply_sparsevlm(
    model,
    n_vis=256,          # visual tokens per image
    target_layers=None, # default: every 4th layer from layer 2
    min_keep=32,        # never prune below this
    tau=0.5,            # recycling fraction
    theta=0.5,          # cluster ratio
)

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jun 5, 2026

This version

0.1.2

Jun 5, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsevlm-0.1.2.tar.gz (19.5 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparsevlm-0.1.2-py3-none-any.whl (16.6 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file sparsevlm-0.1.2.tar.gz.

File metadata

Download URL: sparsevlm-0.1.2.tar.gz
Upload date: Jun 5, 2026
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a38ff574431f01b8a0d79a7525a8e82fc80a67d48a2d22a44f20e346f1a145b5`
MD5	`af0f697e7d5722fa2c73bd661512c8ad`
BLAKE2b-256	`fb794cbc66d1d5847b3764f0457f193b3882532e9f00708780e2bd82f8664c27`

See more details on using hashes here.

File details

Details for the file sparsevlm-0.1.2-py3-none-any.whl.

File metadata

Download URL: sparsevlm-0.1.2-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e93f24b697f9cef637e2dec5073b89182b5e3a4ac7ea5a640808000bcaffa75a`
MD5	`8cf4de63951f42fff97d1b6cff23ccd1`
BLAKE2b-256	`64f970777b6b3a62c6073ad99f208c5afcdd113f87af2a9b7085dfd73afb7109`

See more details on using hashes here.

sparsevlm 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparseVLM — Production Inference Acceleration for Vision-Language Models

Install

Quick start

Benchmark

How it works

Configuration

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes