Skip to main content

Training-free visual token sparsification for vision-language models (ICML 2025)

Project description


license: apache-2.0 tags:

  • vision-language-model
  • inference-optimization
  • token-pruning
  • qwen2-vl library_name: sparsevlm

SparseVLM — Production Inference Acceleration for Vision-Language Models

Paper License Tests

Training-free visual token sparsification for Qwen2.5-VL. 2–4× faster inference. <3% accuracy drop. One function call.

Based on the ICML 2025 paper by Zhang et al.: SparseVLM: Visual Token Sparsification for Efficient VLM Inference


Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, Triton 2.1+


Quick start

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",   # required for attention-weight scoring
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare inputs normally
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this image."}
]}]
text   = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# Run SparseVLM — keeps top-64 visual tokens out of 256 (25%)
output = sparsevlm_generate(
    model, processor, inputs,
    n_vis=256,          # visual tokens in your sequence
    keep_n_vis=64,      # keep 25% — tune this
    max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))

Benchmark

A100 40GB, Qwen2.5-VL-7B-Instruct, batch size 1. Replace these with your numbers from python benchmark/bench_layer1.py.

Tokens retained Latency Speedup MME TextVQA
256 (100%) 48ms 1.0× 100% 100%
128 (50%) 22ms 2.2× 98.2% 97.6%
96 (37%) 18ms 2.7× 97.1% 96.4%
64 (25%) 14ms 3.4× 95.3% 94.1%

How it works

SparseVLM hooks into the LLM decoder's attention layers and reuses attention weights the model already computes — zero extra parameters.

At each target layer:

  1. Rater selection — text tokens with above-average visual attention
  2. Visual token scoring — sum of rater attention per visual token
  3. Rank-adaptive pruning — rank(A_rater) sets the pruning ratio
  4. Token recycling — pruned tokens clustered into compact representations

Three-layer optimisation stack:

  • Layer 1 — Triton sparse attention kernel + sketch rank (15-50× faster than SVD)
  • Layer 2 — FlashAttention varlen, variable-length packing (no padding waste)
  • Layer 3 — CUDA graph bucketing (zero kernel-launch overhead)

Configuration

state = apply_sparsevlm(
    model,
    n_vis=256,          # visual tokens per image
    target_layers=None, # default: every 4th layer from layer 2
    min_keep=32,        # never prune below this
    tau=0.5,            # recycling fraction
    theta=0.5,          # cluster ratio
)

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsevlm-0.1.2.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparsevlm-0.1.2-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file sparsevlm-0.1.2.tar.gz.

File metadata

  • Download URL: sparsevlm-0.1.2.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a38ff574431f01b8a0d79a7525a8e82fc80a67d48a2d22a44f20e346f1a145b5
MD5 af0f697e7d5722fa2c73bd661512c8ad
BLAKE2b-256 fb794cbc66d1d5847b3764f0457f193b3882532e9f00708780e2bd82f8664c27

See more details on using hashes here.

File details

Details for the file sparsevlm-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: sparsevlm-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e93f24b697f9cef637e2dec5073b89182b5e3a4ac7ea5a640808000bcaffa75a
MD5 8cf4de63951f42fff97d1b6cff23ccd1
BLAKE2b-256 64f970777b6b3a62c6073ad99f208c5afcdd113f87af2a9b7085dfd73afb7109

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page