CLI tool for vLLM configuration generation and GPU sizing
Project description
vLLM Config Wizard
A CLI tool for vLLM configuration generation and GPU sizing. Given your model, hardware, and workload requirements, vLLM Wizard generates optimized configurations with VRAM feasibility analysis and approximate performance estimates.
Features
- VRAM Feasibility Analysis: Calculate if your model fits in GPU memory with detailed breakdowns
- Configuration Generation: Generate optimized
vllm servecommands, docker-compose files, and YAML profiles - Performance Estimation: Get approximate throughput and latency estimates (clearly labeled as heuristic)
- GPU Detection: Auto-detect NVIDIA GPUs via nvidia-smi
- Profile Support: Save and load configurations as YAML profiles
Installation
# Install from source
pip install -e .
# With development dependencies
pip install -e ".[dev]"
# With web UI support (optional)
pip install -e ".[web]"
Quick Start
Detect GPUs
# Detect available GPUs
vllm-wizard detect
# JSON output
vllm-wizard detect --json
Plan Configuration
# Basic planning with auto GPU detection
vllm-wizard plan --model meta-llama/Llama-2-7b-hf
# Specify hardware manually
vllm-wizard plan --model meta-llama/Llama-2-7b-hf \
--gpu "RTX 4090" \
--gpus 1 \
--vram-gb 24
# With workload parameters
vllm-wizard plan --model meta-llama/Llama-2-7b-hf \
--gpu "A100 80GB" \
--prompt-tokens 1024 \
--gen-tokens 512 \
--concurrency 8
# JSON output for scripting
vllm-wizard plan --model meta-llama/Llama-2-7b-hf --json
# Include explanations for each parameter
vllm-wizard plan --model meta-llama/Llama-2-7b-hf --explain
Generate Artifacts
# Generate serve command and profile
vllm-wizard generate \
--output-dir ./vllm-config \
--model meta-llama/Llama-2-7b-hf \
--gpu "A100 80GB"
# Include docker-compose
vllm-wizard generate \
--output-dir ./vllm-config \
--model meta-llama/Llama-2-7b-hf \
--emit command,profile,compose
Using Profiles
# Generate from a profile
vllm-wizard plan --profile ./my-config.yaml
# Load and regenerate artifacts
vllm-wizard generate --output-dir ./output --profile ./my-config.yaml
Command Reference
vllm-wizard detect
Detect and display available NVIDIA GPUs.
| Option | Description |
|---|---|
--json |
Output as JSON |
vllm-wizard plan
Compute feasibility, recommendations, and performance estimates.
Model Options:
| Option | Description | Default |
|---|---|---|
--model, -m |
HuggingFace model ID or local path | Required |
--revision |
Model revision/branch | None |
--dtype |
Weight dtype (auto, fp16, bf16, fp32) | auto |
--quantization, -q |
Quantization (none, awq, gptq, int8, fp8) | none |
--kv-cache-dtype |
KV cache dtype | auto |
--max-model-len |
Target context length | Model max |
--params-b |
Model parameters in billions (override) | Auto |
Hardware Options:
| Option | Description | Default |
|---|---|---|
--gpu |
GPU name or "auto" | auto |
--gpus |
Number of GPUs | 1 |
--vram-gb |
VRAM per GPU in GB | Auto |
--tensor-parallel-size, --tp |
Tensor parallel size | Auto |
--interconnect |
GPU interconnect (pcie, nvlink) | unknown |
Workload Options:
| Option | Description | Default |
|---|---|---|
--prompt-tokens |
Typical prompt length | 512 |
--gen-tokens |
Typical generation length | 256 |
--concurrency, -c |
Concurrent sequences | 1 |
--batching-mode |
throughput, latency, balanced | balanced |
Policy Options:
| Option | Description | Default |
|---|---|---|
--gpu-memory-utilization |
GPU memory target (0.5-0.98) | 0.90 |
--overhead-gb |
Fixed overhead in GB | Auto |
--fragmentation-factor |
KV cache fragmentation | 1.15 |
--headroom-gb |
Minimum headroom | 1.0 |
Output Options:
| Option | Description |
|---|---|
--profile, -p |
Load from YAML profile |
--json |
Output as JSON |
--explain |
Include parameter explanations |
vllm-wizard generate
Generate configuration artifacts to disk.
All options from plan plus:
| Option | Description | Default |
|---|---|---|
--output-dir, -o |
Output directory | Required |
--emit |
Artifacts to emit (comma-separated) | command,profile |
Emit options: command, profile, compose, k8s
Understanding the Output
VRAM Breakdown
The VRAM breakdown shows how GPU memory is allocated:
- Model Weights: Memory for model parameters (depends on dtype/quantization)
- KV Cache: Memory for attention key-value cache (scales with context × concurrency)
- Overhead: Framework overhead and communication buffers
- Headroom: Available buffer for runtime allocations
OOM Risk Levels
- LOW: >= 2 GiB headroom, safe to run
- MEDIUM: 0-2 GiB headroom, may work but monitor closely
- HIGH: Negative headroom, likely OOM - consider quantization or reducing context
Performance Estimates
Important: Performance estimates are heuristic approximations based on:
- GPU baseline performance tables
- Scaling factors for model size, tensor parallelism, and context length
- Quantization speedup factors
These are NOT benchmarks. Actual performance depends on:
- vLLM version and kernel selection
- CUDA/driver versions
- Batch sizes and request patterns
- Prompt/generation ratio
Always benchmark your specific workload before production deployment.
Memory Model
Weights Memory
weights_bytes = parameters × bytes_per_param
Bytes per parameter:
- FP32: 4.0
- FP16/BF16: 2.0
- INT8: 1.0
- AWQ/GPTQ (4-bit): ~0.55 (includes overhead)
KV Cache Memory
kv_per_token_per_layer = 2 × num_kv_heads × head_dim × dtype_bytes
kv_cache = kv_per_token_per_layer × num_layers × context_len × concurrency × fragmentation_factor
With GQA (grouped-query attention), num_kv_heads is typically smaller than num_attention_heads, significantly reducing KV cache size.
Profile Format
Profiles use YAML with this schema:
profile_version: 1
model:
id: "meta-llama/Llama-2-7b-hf"
dtype: "auto"
quantization: "none"
max_model_len: 4096
hardware:
gpu_name: "A100 80GB"
gpus: 1
interconnect: "unknown"
workload:
prompt_tokens: 512
gen_tokens: 256
concurrency: 4
mode: "balanced"
policy:
gpu_memory_utilization: 0.90
fragmentation_factor: 1.15
headroom_gb: 1.0
Examples
Single GPU Configuration
# LLaMA 2 7B on RTX 4090
vllm-wizard plan \
--model meta-llama/Llama-2-7b-hf \
--gpu "RTX 4090" \
--max-model-len 4096 \
--concurrency 2
Multi-GPU with Tensor Parallelism
# LLaMA 2 70B on 4x A100 80GB
vllm-wizard plan \
--model meta-llama/Llama-2-70b-hf \
--gpu "A100 80GB" \
--gpus 4 \
--tensor-parallel-size 4 \
--interconnect nvlink \
--max-model-len 4096
Quantized Model
# 70B model with AWQ on single GPU
vllm-wizard plan \
--model TheBloke/Llama-2-70B-AWQ \
--gpu "RTX 4090" \
--quantization awq \
--max-model-len 2048
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=vllm_wizard
# Lint
ruff check src/
License
APACHE 2.0
Disclaimer
This tool provides estimates and recommendations, not guarantees. Always:
- Test configurations on your actual hardware
- Monitor VRAM usage during model loading
- Benchmark throughput/latency for your specific workload
- Start with conservative settings and adjust based on results
Performance estimates are heuristic approximations and should not be used for capacity planning without real benchmarks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_wizard-0.2.0.tar.gz.
File metadata
- Download URL: vllm_wizard-0.2.0.tar.gz
- Upload date:
- Size: 38.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0de524191b079c398e7ca3a3cda8c557ac651a5df81d467182583e42018203cb
|
|
| MD5 |
4b12dbc9d007b563f46e0f6bb919afc6
|
|
| BLAKE2b-256 |
99137416643c4cdc33c1ae70610c4db1240b08fa5b750d421fd375050f59b40f
|
File details
Details for the file vllm_wizard-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vllm_wizard-0.2.0-py3-none-any.whl
- Upload date:
- Size: 38.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d78b5e67c3f0218f732eebf94752375b9e66d7316d4302099c3f1ca8807a735
|
|
| MD5 |
5096c3507e7469e94a9551a7d135a609
|
|
| BLAKE2b-256 |
737fa2fc5d36efdbd6330c5048da6af864769c453f414d028c823901b4ca6ab0
|