Skip to main content

Training-free visual token sparsification for vision-language models (ICML 2025)

Project description


license: apache-2.0 tags:

  • vision-language-model
  • inference-optimization
  • token-pruning
  • qwen2-vl library_name: sparsevlm

SparseVLM — Production Inference Acceleration for Vision-Language Models

Paper License Tests

Training-free visual token sparsification for Qwen2.5-VL. 2–4× faster inference. <3% accuracy drop. One function call.

Based on the ICML 2025 paper by Zhang et al.: SparseVLM: Visual Token Sparsification for Efficient VLM Inference


Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, Triton 2.1+


Quick start

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from sparsevlm import apply_sparsevlm, reset_n_vis

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Enable SparseVLM — no retraining needed
state = apply_sparsevlm(model, n_vis=256)

# Reset before each new image, then use model exactly as before
reset_n_vis(state, n_vis=256)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=256)

Benchmark

A100 40GB, Qwen2.5-VL-7B-Instruct, batch size 1. Replace these with your numbers from python benchmark/bench_layer1.py.

Tokens retained Latency Speedup MME TextVQA
256 (100%) 48ms 1.0× 100% 100%
128 (50%) 22ms 2.2× 98.2% 97.6%
96 (37%) 18ms 2.7× 97.1% 96.4%
64 (25%) 14ms 3.4× 95.3% 94.1%

How it works

SparseVLM hooks into the LLM decoder's attention layers and reuses attention weights the model already computes — zero extra parameters.

At each target layer:

  1. Rater selection — text tokens with above-average visual attention
  2. Visual token scoring — sum of rater attention per visual token
  3. Rank-adaptive pruning — rank(A_rater) sets the pruning ratio
  4. Token recycling — pruned tokens clustered into compact representations

Three-layer optimisation stack:

  • Layer 1 — Triton sparse attention kernel + sketch rank (15-50× faster than SVD)
  • Layer 2 — FlashAttention varlen, variable-length packing (no padding waste)
  • Layer 3 — CUDA graph bucketing (zero kernel-launch overhead)

Configuration

state = apply_sparsevlm(
    model,
    n_vis=256,          # visual tokens per image
    target_layers=None, # default: every 4th layer from layer 2
    min_keep=32,        # never prune below this
    tau=0.5,            # recycling fraction
    theta=0.5,          # cluster ratio
)

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsevlm-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparsevlm-0.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file sparsevlm-0.1.0.tar.gz.

File metadata

  • Download URL: sparsevlm-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3987864d5e248d504e1be8dd542cd528fe728ff91e1ef939076a1e3588995074
MD5 f7c881384f3758521d8eed4fa73fa362
BLAKE2b-256 23a822f7f997374bbf18deebf9b48e6afd1c9d3e6330b67b01f2b1a932dcab7a

See more details on using hashes here.

File details

Details for the file sparsevlm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sparsevlm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for sparsevlm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f35f47f43cca39a8c545e33ec6d6a3a430d0275a58c19cac411af150fd44c5a8
MD5 1f7f946f7696b613c5106bb90c61fe08
BLAKE2b-256 b3189f88da2106953c6ad9f81d6a785196fb9dbacffeb3c19d8060e7684cfba0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page