Training-free visual token sparsification for vision-language models (ICML 2025)

These details have not been verified by PyPI

Project links

Project description

license: apache-2.0 tags:

vision-language-model
inference-optimization
token-pruning
qwen2-vl library_name: sparsevlm

SparseVLM

Training-free visual token pruning for Qwen2.5-VL. Scores visual tokens by how much text attends to them, prunes the unimportant ones from the KV cache, and decodes with the smaller cache.

Based on SparseVLM: Visual Token Sparsification for Efficient VLM Inference (ICML 2025).

Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, transformers 4.49+

Quick start

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate
from PIL import Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

image = Image.open("your_image.jpg")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this image in detail."}
]}]
text   = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# count visual tokens
n_vis = int((inputs["image_grid_thw"][0].prod() / 4).item())

output = sparsevlm_generate(
    model, processor, inputs,
    n_vis=n_vis,
    keep_n_vis=n_vis // 4,   # keep 25% of visual tokens
    max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))

Benchmark results

Measured on NVIDIA A100-SXM4-40GB, Qwen2.5-VL-7B-Instruct, bfloat16, SDPA attention.

Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)

Config	Tokens kept	Time	Speedup	Output quality
Baseline	16320 (100%)	9738ms	1.00×	✅ Identifies Fuji, Milky Way, snow cap, star colors
SparseVLM 50%	8192	9441ms	1.03×	✅ Same quality
SparseVLM 25%	4080	9297ms	1.05×	✅ All key details preserved
SparseVLM 10%	1632	9425ms	1.03×	✅ Still correctly describes scene

Key result: Full 4K image (16K tokens) runs without OOM. Without SparseVLM's hook-based scoring, the 16K-token image requires materialising a 15GB attention matrix and crashes. The scorer computes only the text→visual submatrix (35 × 16320 = 32MB instead of 15GB).

Resized photo (896×504px, 576 visual tokens), batch=1

Tokens kept	Time	Speedup
576 (100%)	2167ms	1.00×
288 (50%)	1685ms	1.29×
144 (25%)	1565ms	1.39×
72 (12%)	1620ms	1.34×

When to expect larger speedup

Speedup grows when the KV cache is large relative to model weights:

Scenario	Expected speedup
Single image, short generation	~1.1–1.4×
Single image, 256+ output tokens	~1.5–2.5×
Batch=32, high-res images	~2–4×
Very long visual context (10K+ tokens)	~2–4×

How it works

Token scoring (no extra parameters)

At decoder layer 2, a lightweight hook intercepts the attention projection and computes:

A_tv = Q_text @ K_visual^T   # only the text→visual submatrix
                              # 35 × 16320 instead of 16320 × 16320
score_i = sum over text tokens of attention to visual token i

Visual tokens with high scores are important to the text query. Low-score tokens are pruned from the KV cache before decoding starts.

KV cache pruning

After scoring, the KV cache is sliced to keep only the top-K visual entries plus all text entries. The model then decodes with a smaller cache — fewer keys to attend over per decode step.

Prefill:  build KV cache for all 16320 visual tokens
Score:    rank each visual token by text attention (32MB op)
Prune:    keep top-K, drop the rest
Decode:   attend over K + N_text keys instead of 16320 + N_text

Position fix (`rope_deltas`)

After pruning, Qwen2.5-VL's internal position counter (rope_deltas) is adjusted so decode tokens get correct positional embeddings despite the shorter cache.

API

`sparsevlm_generate`

from sparsevlm import sparsevlm_generate

output = sparsevlm_generate(
    model,                  # Qwen2_5_VLForConditionalGeneration
    processor,              # AutoProcessor
    inputs,                 # dict from processor(...)
    n_vis,                  # total visual tokens in the sequence
    keep_n_vis,             # how many to keep (e.g. n_vis // 4 for 25%)
    max_new_tokens=256,     # generation length
    target_layer=2,         # which layer to score from (default 2)
    device="cuda",          # primary device
)
# returns: token ids [B, max_new_tokens]

`apply_sparsevlm` / `remove_hooks` (hook-based API)

from sparsevlm import apply_sparsevlm, reset_n_vis, remove_hooks

state = apply_sparsevlm(model, n_vis=256)
reset_n_vis(state, n_vis=256)   # call before each generate
output = model.generate(...)
remove_hooks(state)

Model support

Model	Status
Qwen/Qwen2.5-VL-7B-Instruct	✅ Tested
Qwen/Qwen2.5-VL-3B-Instruct	✅ Should work
Qwen/Qwen2.5-VL-72B-Instruct	✅ Should work
Qwen/Qwen2-VL-*	✅ Legacy support

Limitations

Requires attn_implementation="eager" or "sdpa". Flash Attention 2 (separate package) is not required.
Speedup is modest (~1.1–1.4×) for single-image, short-generation use cases. The gain comes from long generations, high-resolution images, or batched serving.
Currently tested with Qwen2.5-VL. Other VLM families would need architecture-specific adaptation.

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

Apache 2.0 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Jun 5, 2026

0.1.2

Jun 5, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsevlm-0.1.3.tar.gz (21.7 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparsevlm-0.1.3-py3-none-any.whl (17.7 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file sparsevlm-0.1.3.tar.gz.

File metadata

Download URL: sparsevlm-0.1.3.tar.gz
Upload date: Jun 5, 2026
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`89a5d91a111970695a9b2a8b89e782b1719f4732317d776f4ec1ae59ef20a358`
MD5	`b0876e6099fc1704da2b3dd98d988ffe`
BLAKE2b-256	`286490c226cd6066941be0a83d7400c637465662fb698880f5bf14cb75ebcc5b`

See more details on using hashes here.

File details

Details for the file sparsevlm-0.1.3-py3-none-any.whl.

File metadata

Download URL: sparsevlm-0.1.3-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 17.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`655e6d168b45e26e0b35909b35a3a9750371319df3b2ad2d87071fa658b68a58`
MD5	`d607bd0665067d5e558c56fec63958be`
BLAKE2b-256	`9857ed99721e4a1080b19ade78543043e344d21a97c6e142098258b84a4bbe06`

See more details on using hashes here.

sparsevlm 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparseVLM

Install

Quick start

Benchmark results

Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)

Resized photo (896×504px, 576 visual tokens), batch=1

When to expect larger speedup

How it works

Token scoring (no extra parameters)

KV cache pruning

Position fix (rope_deltas)

API

sparsevlm_generate

apply_sparsevlm / remove_hooks (hook-based API)

Model support

Limitations

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Position fix (`rope_deltas`)

`sparsevlm_generate`

`apply_sparsevlm` / `remove_hooks` (hook-based API)