Skip to main content

Single-file, drop-in VLM benchmark CLI for your agents.

Project description

VLM Run Logo

vlmbench

Single-file, drop-in VLM benchmark CLI for your agents.

PyPI Version Python Versions PyPI Downloads
License Discord Twitter Follow

Benchmark any vision-language model on your own hardware with a single command. vlmbench auto-detects your platform, starts the right backend, and gives you reproducible results as JSON.

  • Ollama on macOS: auto-starts, zero config
  • vLLM on Linux: via Docker (--gpus all, auto-pulls) or native vLLM
  • SGLang on Linux: coming soon
image

Quick Start

No install needed — just run with uvx:

# Local images/PDFs (macOS Ollama)
uvx vlmbench run -m qwen3-vl:2b -i ./images/

# Linux + vLLM Docker (auto-starts with --gpus all)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/

# HuggingFace dataset (images)
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64

# HuggingFace dataset (text-only — use a column as the prompt)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Concurrency sweep
uvx vlmbench run -m Qwen/Qwen3-VL-8B-Instruct -i ./images/ \
  --concurrency 4,8,16,32,64

# Use a model profile (custom serve args + setup)
uvx vlmbench run --profile deepseek-ocr -i ./images/

# Cloud / remote API (model auto-detected from server)
uvx vlmbench run -i ./images/ \
  --base-url https://my-server.example.com/v1 --api-key $API_KEY

# Cloud API with explicit model
uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -i ./images/ \
  --base-url https://api.openai.com/v1 --api-key $OPENAI_API_KEY

Or install it: pip install vlmbench

Example Run

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct \
  -d hf://vlm-run/FineVision-vlmbench-mini --max-samples 64 \
  --prompt "Describe this image in 80 words or less" \
  --concurrency 4,8,16 --backend vllm
╭─ Configuration ──────────────────────────────────────────────────────────────╮
│                                                                              │
│  model        Qwen/Qwen3-VL-2B-Instruct                                      │
│  revision     main                                                           │
│  backend      vLLM 0.11.2                                                    │
│  endpoint     http://localhost:8000/v1                                       │
│                                                                              │
│  gpu          NVIDIA RTX PRO 6000 Blackwell Workstation Edition              │
│  vram         97,887 MiB                                                     │
│  driver       580.126.09                                                     │
│                                                                              │
│  dataset      hf://vlm-run/FineVision-vlmbench-mini                          │
│  images       64 (mixed)                                                     │
│                                                                              │
│  max_tokens   2048                                                           │
│  runs         3                                                              │
│  concurrency  8                                                              │
│                                                                              │
│  monitor      tmux attach -t vlmbench-vllm                                   │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─ Results ────────────────────────────────────────────────────────────────────╮
│                                                                              │
│  Metric                Value              p50        p95        p99          │
│  Throughput            13.33 img/s         —          —          —           │
│  Tokens/sec            1168 tok/s          —          —          —           │
│  Workers               8                   —          —          —           │
│  TTFT                  58 ms           51 ms     114 ms     140 ms           │
│  TPOT                  5.3 ms         5.0 ms     7.3 ms     7.4 ms           │
│  Latency (per worker)  0.54 s/img     0.46 s     0.92 s     1.36 s           │
│                                                                              │
│  Tokens (avg)          prompt 2,077  •  completion 88                        │
│  Token ranges          prompt 180–8,545  •  completion 55–190                │
│  Images                144  •  avg 964×867 (0.93 MP)                         │
│  Resolution            min 338×266  •  median 1024×768  •  max 2048×1755     │
│  VRAM peak             69.7 GB                                               │
│  Reliability           192/192 ok  •  14.4s total                            │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Leaderboard

Best peak throughput per model on NVIDIA RTX PRO 6000 Blackwell (vLLM v0.15.1, 39 runs across concurrency sweeps):

# Model Best Tok/s Workers TTFT TPOT
1 lightonai/LightOnOCR-2-1B 2,439.8 32 1,439 ms 22.1 ms
2 Qwen/Qwen3-VL-2B-Instruct 2,409.3 64 440 ms 14.3 ms
3 PaddlePaddle/PaddleOCR-VL 2,341.9 64 6,385 ms 49.0 ms
4 deepseek-ai/DeepSeek-OCR 1,195.8 32 3,571 ms 15.9 ms
5 Qwen/Qwen3-VL-8B-Instruct 953.8 64 448 ms 25.7 ms

Compare your own results:

uvx vlmbench compare                       # auto-discovers ~/.vlmbench/benchmarks/
uvx vlmbench compare results/*.json        # or pass files explicitly

See MODELS.md for all tested models and their required --serve-args.

Profiles

Some models need custom Docker images, extra pip installs, or special serve args. Profiles bundle all of this into a single YAML file — just pass --profile and vlmbench handles the rest.

uvx vlmbench profiles                                  # list available profiles
uvx vlmbench run --profile deepseek-ocr -i ./images/   # run with a profile

When you use --profile, it sets --model, --prompt, --serve-args, and (for Docker builds) the base image and setup commands. You can still override any flag explicitly.

Profile Model Base Image Custom Setup
glm-ocr zai-org/GLM-OCR vllm/vllm-openai:nightly vLLM nightly + transformers >= 5.1.0, MTP speculative decoding
deepseek-ocr deepseek-ai/DeepSeek-OCR vllm/vllm-openai:v0.15.1 Custom logits processor, no prefix caching
paddleocr-vl PaddlePaddle/PaddleOCR-VL vllm/vllm-openai:v0.15.1 Trust remote code, no prefix caching
qwen3-vl-2b Qwen/Qwen3-VL-2B-Instruct vllm/vllm-openai:v0.15.1
qwen3-vl-8b Qwen/Qwen3-VL-8B-Instruct vllm/vllm-openai:v0.15.1

Profiles live in vlmbench/profiles/*.yaml and ship with the package. For local Docker workflows:

make build PROFILE=glm-ocr        # generates Dockerfile + docker build
make serve PROFILE=glm-ocr        # start server in tmux
make benchmark PROFILE=glm-ocr    # run benchmark against the server

CLI Reference

Flag Default Description
--model / -m auto-detect Model ID. Auto-detected from server if omitted; required only with --serve.
--profile none Model profile (e.g. glm-ocr). Sets model, prompt, serve-args. See vlmbench profiles.
--input / -i sample URL File, directory, or URL (images, PDFs, videos)
--dataset / -d none HuggingFace dataset (e.g. hf://vlm-run/FineVision-vlmbench-mini)
--dataset-image-col auto-detect Image column name in HF dataset
--dataset-text-col none Text column name in HF dataset to use as prompt/document input
--dataset-split train Dataset split to load
--base-url auto-detect OpenAI-compatible base URL
--api-key no-key API key (also reads OPENAI_API_KEY env)
--prompt "Extract all text..." Prompt/instruction sent with each input. Pass "" to use the text column as the full message.
--max-tokens 2048 Max completion tokens
--runs 3 Timed runs per input
--warmup 1 Warmup runs (not recorded, fail-fast on errors)
--concurrency 8 Single value or comma-separated sweep (e.g. 4,8,16,32,64)
--max-samples all Limit number of input samples (useful for dry-runs)
--output-directory ~/.vlmbench/benchmarks/ Output directory
--tag none Custom label (used in result filename and metadata)
--upload off Upload results to HuggingFace (requires HF_TOKEN)
--upload-repo vlm-run/vlmbench-results HuggingFace dataset repo for uploads
--backend auto auto, ollama, vllm, vllm-openai:<tag>, sglang:<tag>
--serve/--no-serve --serve Auto-start server if none detected
--serve-args none Extra args passed to server
--quant auto Quantization metadata: fp16, bf16, q4_K_M, etc.
--revision main Model revision metadata

Backends

--backend Resolves to Serving
auto ollama on macOS, vllm-openai:latest on Linux Native / Docker
ollama Ollama native ollama serve in tmux
vllm Native vLLM vllm serve in tmux
vllm-openai:latest vllm/vllm-openai:latest docker run --gpus all
vllm-openai:nightly vllm/vllm-openai:nightly docker run --gpus all
sglang:latest lmsysorg/sglang:latest docker run --gpus all (coming soon)

All Docker backends run with --gpus all --ipc=host and a deterministic container name for easy log access.

Input Types

Type Source Processing
Image --input (.png, .jpg, .jpeg, .webp, .tiff, .bmp) Base64 encode
PDF --input (.pdf) pypdfium2 per-page → base64
Video --input (.mp4, .mov, .avi, .mkv, .webm) ffmpeg 1fps → frames → base64
HF image dataset --dataset hf://... Auto-detect image column, base64 encode
HF text dataset --dataset hf://... --dataset-text-col <col> Each row's value sent as a text content block

Directories are processed recursively, sorted alphabetically.

Text-only benchmarks

For LLM (non-vision) benchmarks, use an HF dataset with a text column:

# Each row's "prompt" column is the full message (--prompt "" = no instruction appended)
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-prompts --dataset-text-col prompt --prompt ""

# Each row's "text" column is the document; --prompt is the instruction
uvx vlmbench run -m meta-llama/Llama-3.1-8B-Instruct \
  -d hf://my-org/my-docs --dataset-text-col text \
  --prompt "Summarize the above in one sentence."

Auto-detection falls back to text columns (named text, prompt, input, content, query, question, instruction) when no image column is found.

Output

Results are saved as JSON to ~/.vlmbench/benchmarks/ with model metadata, environment info, benchmark stats (TTFT, TPOT, throughput, latency percentiles), and raw per-run data. Each concurrency level produces a separate file.

Upload results to HuggingFace with --upload:

uvx vlmbench run -m Qwen/Qwen3-VL-2B-Instruct -d hf://vlm-run/FineVision-vlmbench-mini \
  --concurrency 4,8,16,32,64 --upload

Browse uploaded results at vlm-run/vlmbench-results.

How It Works

When you run vlmbench run, here's what happens:

  1. Detects your platform — macOS routes to Ollama, Linux to vLLM Docker
  2. Pulls the Docker imagedocker pull vllm/vllm-openai:latest (cached after first run)
  3. Starts the server in tmuxdocker run --gpus all in a named session (vlmbench-vllm)
  4. Launches a GPU monitornvitop (Linux) or macmon (macOS) in a split pane
  5. Waits for the server — polls /v1/models until ready (up to 600s)
  6. Runs warmup requests — fail-fast validation before timed runs
  7. Benchmarks with concurrency — streams completions via the OpenAI API, measures TTFT/TPOT/throughput
  8. Saves results as JSON — one file per concurrency level in ~/.vlmbench/benchmarks/

Attach to the live session anytime: tmux attach -t vlmbench-vllm

tmux session capture — server logs + GPU monitor side by side

Top pane — vLLM server logs:

(APIServer pid=1) INFO 02-07 15:44:24 non-default args: {
  'model': 'lightonai/LightOnOCR-2-1B',
  'enable_prefix_caching': False,
  'limit_mm_per_prompt': {'image': 1},
  'mm_processor_cache_gb': 0.0
}
(APIServer pid=1) INFO 02-07 15:44:34 Resolved architecture: LightOnOCRForConditionalGeneration
(APIServer pid=1) INFO 02-07 15:44:34 Using max model len 16384
(EngineCore pid=272) INFO 02-07 15:44:44 Initializing a V1 LLM engine (v0.15.1) with config:
  model='lightonai/LightOnOCR-2-1B', dtype=torch.bfloat16, max_seq_len=16384,
  tensor_parallel_size=1, quantization=None
(EngineCore pid=272) INFO 02-07 15:45:41 Loading weights took 0.49 seconds
(EngineCore pid=272) INFO 02-07 15:45:42 Model loading took 1.88 GiB memory and 22.15 seconds
(EngineCore pid=272) INFO 02-07 15:46:11 Available KV cache memory: 77.94 GiB
(EngineCore pid=272) INFO 02-07 15:46:11 Maximum concurrency for 16,384 tokens per request: 44.53x
Capturing CUDA graphs (decode, FULL): 100% |██████████| 51/51
(APIServer pid=1) INFO Started server process [1]
(APIServer pid=1) INFO Application startup complete.
(APIServer pid=1) INFO 172.17.0.1 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Bottom pane — nvitop GPU monitor:

NVITOP 1.6.2      Driver Version: 580.126.09      CUDA Driver Version: 13.0
╒═══════════════════════════════╤══════════════════════╤══════════════════════╕
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│   0  GeForce RTX 2080 Ti  Off │ 00000000:21:00.0 Off │                  N/A │
│ 27%   42C   P8     17W / 250W │  107.2MiB / 11264MiB │      0%      Default │
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│   1  RTX PRO 6000         Off │ 00000000:4B:00.0 Off │                  N/A │
│ 30%   33C   P1     66W / 600W │  86.54GiB / 95.59GiB │      0%      Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛
  MEM: ███████████████████████████████████████████████████████████▏ 90.5%
  Load Average: 4.14  2.73  1.65

Claude Code Installation

Install vlmbench as a Claude Code plugin:

# 1. Register the marketplace
/plugin marketplace add vlm-run/vlmbench

# 2. Install the skill
/plugin install vlmbench@vlm-run/vlmbench

After restarting Claude Code, the vlmbench skill will be available. Mention it directly in your instructions to benchmark models, compare results, or debug server issues.

Requirements

  • Python >= 3.11, uv recommended
  • Linux: Docker + NVIDIA GPU support (or native vLLM via uv pip install vllm)
  • Monitoring: tmux, nvitop (Linux) or macmon (macOS)
  • Optional: ffmpeg (video frame extraction)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vlmbench-0.5.5.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vlmbench-0.5.5-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file vlmbench-0.5.5.tar.gz.

File metadata

  • Download URL: vlmbench-0.5.5.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlmbench-0.5.5.tar.gz
Algorithm Hash digest
SHA256 c241096bcd85018e02b6b62d07740ba3107e6df754f57945b6e82fa3df383c92
MD5 e1a29aec134e3764b8c52164300f6fca
BLAKE2b-256 b774750dc41ec1d7fb2ea55d6b1bab780f119c18cd90ced16c03839b3d948c4c

See more details on using hashes here.

File details

Details for the file vlmbench-0.5.5-py3-none-any.whl.

File metadata

  • Download URL: vlmbench-0.5.5-py3-none-any.whl
  • Upload date:
  • Size: 47.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vlmbench-0.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1918fd210d33317819c37272fbc45e8c3db0668acda83c7996e4cc83c31f8b25
MD5 34909cba98d48974e324f25752b1adb4
BLAKE2b-256 7e8c571dc2279f66cb075a770e22f7e2054191efc9462234e1aaa2c4ebd6c768

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page