CLI tool for vLLM configuration generation and GPU sizing

These details have not been verified by PyPI

Project links

Project description

vLLM Config Wizard

A CLI tool for vLLM configuration generation and GPU sizing. Given your model, hardware, and workload requirements, vLLM Wizard generates optimized configurations with VRAM feasibility analysis and approximate performance estimates.

Features

VRAM Feasibility Analysis: Calculate if your model fits in GPU memory with detailed breakdowns
Configuration Generation: Generate optimized vllm serve commands, docker-compose files, and YAML profiles
Performance Estimation: Get approximate throughput and latency estimates (clearly labeled as heuristic)
GPU Detection: Auto-detect NVIDIA GPUs via nvidia-smi
Profile Support: Save and load configurations as YAML profiles

Installation

# Install from source
pip install -e .

# With development dependencies
pip install -e ".[dev]"

# With web UI support (optional)
pip install -e ".[web]"

Quick Start

Detect GPUs

# Detect available GPUs
vllm-wizard detect

# JSON output
vllm-wizard detect --json

Plan Configuration

# Basic planning with auto GPU detection
vllm-wizard plan --model meta-llama/Llama-2-7b-hf

# Specify hardware manually
vllm-wizard plan --model meta-llama/Llama-2-7b-hf \
  --gpu "RTX 4090" \
  --gpus 1 \
  --vram-gb 24

# With workload parameters
vllm-wizard plan --model meta-llama/Llama-2-7b-hf \
  --gpu "A100 80GB" \
  --prompt-tokens 1024 \
  --gen-tokens 512 \
  --concurrency 8

# JSON output for scripting
vllm-wizard plan --model meta-llama/Llama-2-7b-hf --json

# Include explanations for each parameter
vllm-wizard plan --model meta-llama/Llama-2-7b-hf --explain

Generate Artifacts

# Generate serve command and profile
vllm-wizard generate \
  --output-dir ./vllm-config \
  --model meta-llama/Llama-2-7b-hf \
  --gpu "A100 80GB"

# Include docker-compose
vllm-wizard generate \
  --output-dir ./vllm-config \
  --model meta-llama/Llama-2-7b-hf \
  --emit command,profile,compose

Using Profiles

# Generate from a profile
vllm-wizard plan --profile ./my-config.yaml

# Load and regenerate artifacts
vllm-wizard generate --output-dir ./output --profile ./my-config.yaml

Command Reference

`vllm-wizard detect`

Detect and display available NVIDIA GPUs.

Option	Description
`--json`	Output as JSON

`vllm-wizard plan`

Compute feasibility, recommendations, and performance estimates.

Model Options:

Option	Description	Default
`--model, -m`	HuggingFace model ID or local path	Required
`--revision`	Model revision/branch	None
`--dtype`	Weight dtype (auto, fp16, bf16, fp32)	auto
`--quantization, -q`	Quantization (none, awq, gptq, int8, fp8)	none
`--kv-cache-dtype`	KV cache dtype	auto
`--max-model-len`	Target context length	Model max
`--params-b`	Model parameters in billions (override)	Auto

Hardware Options:

Option	Description	Default
`--gpu`	GPU name or "auto"	auto
`--gpus`	Number of GPUs	1
`--vram-gb`	VRAM per GPU in GB	Auto
`--tensor-parallel-size, --tp`	Tensor parallel size	Auto
`--interconnect`	GPU interconnect (pcie, nvlink)	unknown

Workload Options:

Option	Description	Default
`--prompt-tokens`	Typical prompt length	512
`--gen-tokens`	Typical generation length	256
`--concurrency, -c`	Concurrent sequences	1
`--batching-mode`	throughput, latency, balanced	balanced

Policy Options:

Option	Description	Default
`--gpu-memory-utilization`	GPU memory target (0.5-0.98)	0.90
`--overhead-gb`	Fixed overhead in GB	Auto
`--fragmentation-factor`	KV cache fragmentation	1.15
`--headroom-gb`	Minimum headroom	1.0

Output Options:

Option	Description
`--profile, -p`	Load from YAML profile
`--json`	Output as JSON
`--explain`	Include parameter explanations

`vllm-wizard generate`

Generate configuration artifacts to disk.

All options from plan plus:

Option	Description	Default
`--output-dir, -o`	Output directory	Required
`--emit`	Artifacts to emit (comma-separated)	command,profile

Emit options: command, profile, compose, k8s

Understanding the Output

VRAM Breakdown

The VRAM breakdown shows how GPU memory is allocated:

Model Weights: Memory for model parameters (depends on dtype/quantization)
KV Cache: Memory for attention key-value cache (scales with context × concurrency)
Overhead: Framework overhead and communication buffers
Headroom: Available buffer for runtime allocations

OOM Risk Levels

LOW: >= 2 GiB headroom, safe to run
MEDIUM: 0-2 GiB headroom, may work but monitor closely
HIGH: Negative headroom, likely OOM - consider quantization or reducing context

Performance Estimates

Important: Performance estimates are heuristic approximations based on:

GPU baseline performance tables
Scaling factors for model size, tensor parallelism, and context length
Quantization speedup factors

These are NOT benchmarks. Actual performance depends on:

vLLM version and kernel selection
CUDA/driver versions
Batch sizes and request patterns
Prompt/generation ratio

Always benchmark your specific workload before production deployment.

Memory Model

Weights Memory

weights_bytes = parameters × bytes_per_param

Bytes per parameter:
- FP32: 4.0
- FP16/BF16: 2.0
- INT8: 1.0
- AWQ/GPTQ (4-bit): ~0.55 (includes overhead)

KV Cache Memory

kv_per_token_per_layer = 2 × num_kv_heads × head_dim × dtype_bytes
kv_cache = kv_per_token_per_layer × num_layers × context_len × concurrency × fragmentation_factor

With GQA (grouped-query attention), num_kv_heads is typically smaller than num_attention_heads, significantly reducing KV cache size.

Profile Format

Profiles use YAML with this schema:

profile_version: 1
model:
  id: "meta-llama/Llama-2-7b-hf"
  dtype: "auto"
  quantization: "none"
  max_model_len: 4096
hardware:
  gpu_name: "A100 80GB"
  gpus: 1
  interconnect: "unknown"
workload:
  prompt_tokens: 512
  gen_tokens: 256
  concurrency: 4
  mode: "balanced"
policy:
  gpu_memory_utilization: 0.90
  fragmentation_factor: 1.15
  headroom_gb: 1.0

Examples

Single GPU Configuration

# LLaMA 2 7B on RTX 4090
vllm-wizard plan \
  --model meta-llama/Llama-2-7b-hf \
  --gpu "RTX 4090" \
  --max-model-len 4096 \
  --concurrency 2

Multi-GPU with Tensor Parallelism

# LLaMA 2 70B on 4x A100 80GB
vllm-wizard plan \
  --model meta-llama/Llama-2-70b-hf \
  --gpu "A100 80GB" \
  --gpus 4 \
  --tensor-parallel-size 4 \
  --interconnect nvlink \
  --max-model-len 4096

Quantized Model

# 70B model with AWQ on single GPU
vllm-wizard plan \
  --model TheBloke/Llama-2-70B-AWQ \
  --gpu "RTX 4090" \
  --quantization awq \
  --max-model-len 2048

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=vllm_wizard

# Lint
ruff check src/

License

APACHE 2.0

Disclaimer

This tool provides estimates and recommendations, not guarantees. Always:

Test configurations on your actual hardware
Monitor VRAM usage during model loading
Benchmark throughput/latency for your specific workload
Start with conservative settings and adjust based on results

Performance estimates are heuristic approximations and should not be used for capacity planning without real benchmarks.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 15, 2026

0.1.0

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_wizard-0.2.0.tar.gz (38.9 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_wizard-0.2.0-py3-none-any.whl (38.7 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file vllm_wizard-0.2.0.tar.gz.

File metadata

Download URL: vllm_wizard-0.2.0.tar.gz
Upload date: Feb 15, 2026
Size: 38.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for vllm_wizard-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0de524191b079c398e7ca3a3cda8c557ac651a5df81d467182583e42018203cb`
MD5	`4b12dbc9d007b563f46e0f6bb919afc6`
BLAKE2b-256	`99137416643c4cdc33c1ae70610c4db1240b08fa5b750d421fd375050f59b40f`

See more details on using hashes here.

File details

Details for the file vllm_wizard-0.2.0-py3-none-any.whl.

File metadata

Download URL: vllm_wizard-0.2.0-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 38.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for vllm_wizard-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d78b5e67c3f0218f732eebf94752375b9e66d7316d4302099c3f1ca8807a735`
MD5	`5096c3507e7469e94a9551a7d135a609`
BLAKE2b-256	`737fa2fc5d36efdbd6330c5048da6af864769c453f414d028c823901b4ca6ab0`

See more details on using hashes here.

vllm-wizard 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vLLM Config Wizard

Features

Installation

Quick Start

Detect GPUs

Plan Configuration

Generate Artifacts

Using Profiles

Command Reference

vllm-wizard detect

vllm-wizard plan

vllm-wizard generate

Understanding the Output

VRAM Breakdown

OOM Risk Levels

Performance Estimates

Memory Model

Weights Memory

KV Cache Memory

Profile Format

Examples

Single GPU Configuration

Multi-GPU with Tensor Parallelism

Quantized Model

Development

License

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`vllm-wizard detect`

`vllm-wizard plan`

`vllm-wizard generate`