Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters
Project description
vllm-autoconfig
Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters based on your GPU hardware and model requirements.
๐ Features
- Zero-configuration vLLM setup: Automatically calculates optimal
max_model_len,gpu_memory_utilization, and other vLLM parameters - Hardware-aware planning: Probes GPU memory and capabilities using PyTorch to ensure configurations fit your hardware
- Model-specific optimizations: Applies model-family-specific settings (Mistral, Llama, Qwen, etc.)
- KV cache sizing: Intelligently calculates memory requirements for attention key-value caches
- Configuration caching: Saves computed plans to avoid redundant calculations
- Performance modes: Choose between
throughputandlatencyoptimization strategies - FP8 KV cache support: Automatically enables FP8 quantization for KV caches when beneficial
- Simple API: Just specify your model name and desired context length - everything else is handled automatically
๐ฆ Installation
pip install vllm-speculative-autoconfig
Requirements:
- Python >= 3.10
- PyTorch with CUDA support
- vLLM
- Access to CUDA-capable GPU(s)
๐ฏ Quick Start
Python API
from vllm_autoconfig import AutoVLLMClient, SamplingConfig
# Initialize with your model and desired context length
client = AutoVLLMClient(
model_name="meta-llama/Llama-3.1-8B-Instruct",
context_len=1024, # The ONLY parameter you need to set!
)
# Prepare your prompts
prompts = [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"metadata": {"id": 1},
}
]
# Run inference: chat format
results = client.run_batch_chat(
prompts,
SamplingConfig(max_tokens=100, temperature=0.7)
)
# Alternatively, run standard text generation
results = client.run_batch_raw(
prompts,
SamplingConfig(max_tokens=100, temperature=0.7)
)
print(results)
client.close()
Advanced Usage
from vllm_autoconfig import AutoVLLMClient, SamplingConfig
# Fine-tune the configuration
client = AutoVLLMClient(
model_name="mistralai/Mistral-7B-Instruct-v0.3",
context_len=2048,
perf_mode="latency", # or "throughput" (default)
prefer_fp8_kv_cache=True, # Enable FP8 KV cache if supported
trust_remote_code=False, # For models requiring custom code
debug=True, # Enable detailed logging
)
# Check the computed plan
print(f"Plan cache key: {client.plan.cache_key}")
print(f"vLLM kwargs: {client.plan.vllm_kwargs}")
print(f"Notes: {client.plan.notes}")
# Run inference with custom sampling
sampling = SamplingConfig(
temperature=0.8,
top_p=0.95,
max_tokens=256,
stop=["###", "\n\n"]
)
results = client.run_batch_chat(prompts, sampling)
client.close()
Embedding Generation
from vllm_autoconfig import AutoVLLMEmbedding
import numpy as np
# Initialize embedding client - only specify model and max length!
# GPU settings are automatically configured
client = AutoVLLMEmbedding(
model_name="meta-llama/Llama-3.1-8B-Instruct",
max_model_len=512, # Typical for embeddings
)
# Generate embeddings for texts
texts = [
"The cat sits on the mat",
"A feline rests on the carpet",
"Dogs are playing in the park",
]
embeddings = client.embed(texts, normalize=True)
print(f"Embeddings shape: {embeddings.shape}") # (3, embedding_dim)
# Compute cosine similarity (embeddings are already normalized)
def cosine_similarity(a, b):
return np.dot(a, b)
sim = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity between 'cat' and 'feline': {sim:.4f}")
client.close()
๐ ๏ธ How It Works
- GPU Probing: Detects available GPU memory and capabilities (BF16 support, compute capability)
- Model Analysis: Downloads model configuration from HuggingFace Hub and analyzes architecture
- Weight Calculation: Computes actual model weight size from checkpoint files
- Memory Planning: Calculates KV cache memory requirements based on context length and batch size
- Configuration Generation: Produces optimal vLLM initialization parameters within hardware constraints
- Caching: Saves the computed plan for reuse with the same configuration
๐ Configuration Parameters
The AutoVLLMClient automatically configures:
model: Model name/pathmax_model_len: Maximum sequence lengthgpu_memory_utilization: GPU memory usage fractiondtype: Weight precision (bfloat16 or float16)kv_cache_dtype: KV cache precision (including FP8 when beneficial)enforce_eager: Whether to use eager mode (affects compilation)trust_remote_code: Whether to trust remote code execution- Model-specific parameters (e.g.,
tokenizer_mode,load_formatfor Mistral)
๐๏ธ API Reference
AutoVLLMClient
AutoVLLMClient(
model_name: str, # HuggingFace model name or local path
context_len: int, # Desired context length
logits_processors: List[Type] = None, # Custom logits processor classes
device_index: int = 0, # GPU device index
perf_mode: str = "throughput", # "throughput" or "latency"
trust_remote_code: bool = False,
prefer_fp8_kv_cache: bool = False,
enforce_eager: bool = False,
local_files_only: bool = False,
cache_plan: bool = True, # Cache computed plans
debug: bool = False, # Enable debug logging
vllm_logging_level: str = None, # vLLM logging level
**vllm_kwargs: Any, # Additional vLLM engine parameters
)
SamplingConfig
SamplingConfig(
temperature: float = 0.0, # Sampling temperature
top_p: float = 1.0, # Nucleus sampling threshold
max_tokens: int = 32, # Maximum tokens to generate
stop: List[str] = None, # Stop sequences
extra_args: Dict[str, Any] = None, # Extra arguments for vLLM (e.g., gate_enabled)
)
Methods
run_batch(prompts, sampling, output_field="output"): Run inference on a batch of promptsclose(): Clean up resources and free GPU memory
Logits Processors
AutoVLLMClient supports custom logits processors for advanced use cases like token filtering, boosting, or custom sampling logic.
from vllm_autoconfig import AutoVLLMClient, ExampleLogitsProcessor
client = AutoVLLMClient(
model_name="meta-llama/Llama-3.1-8B-Instruct",
context_len=1024,
logits_processors=[ExampleLogitsProcessor], # Pass processor classes
)
Key Points:
- Pass processor classes (not instances) in a list
- vLLM automatically instantiates them with proper configuration
- Processors receive
vllm_config,device, andis_pin_memoryduring init - Configure processors via environment variables
- See
examples/logits_processor_example.pyfor a complete example
Additional vLLM Kwargs
You can pass additional keyword arguments directly to the vLLM engine:
client = AutoVLLMClient(
model_name="meta-llama/Llama-3.1-8B-Instruct",
context_len=1024,
# Any additional vLLM-specific parameters
swap_space=4,
disable_custom_all_reduce=True,
)
These kwargs are passed through to vLLM's LLM initialization.
AutoVLLMEmbedding
AutoVLLMEmbedding(
model_name: str, # HuggingFace model name or local path
max_model_len: int = 512, # Maximum sequence length (embeddings typically need less)
pooling_type: str = "MEAN", # "MEAN", "CLS", or "LAST"
normalize: bool = False, # Normalize in vLLM (default: False, normalized manually)
device_index: int = 0, # GPU device index
perf_mode: str = "throughput", # "throughput" or "latency"
trust_remote_code: bool = False,
enforce_eager: bool = True, # Better compatibility for embeddings
local_files_only: bool = False,
cache_plan: bool = True, # Cache computed plans
debug: bool = False, # Enable debug logging
vllm_logging_level: str = None,
)
Note: GPU memory utilization, tensor parallelism, dtype, and other hardware-specific settings are automatically configured by the planner based on your GPU and model. You don't need to specify them!
Methods
embed(texts, normalize=True): Generate embeddings for a list of texts, returns numpy array (N, D)embed_batch(texts, normalize=True): Alias forembed()close(): Clean up resources and free GPU memory
๐๏ธ Project Structure
vllm-autoconfig/
โโโ src/vllm_autoconfig/
โ โโโ __init__.py # Package exports
โ โโโ client.py # AutoVLLMClient implementation (text generation)
โ โโโ embedding.py # AutoVLLMEmbedding implementation (embeddings)
โ โโโ planner.py # Configuration planning logic
โ โโโ gpu_probe.py # GPU detection and probing
โ โโโ model_probe.py # Model analysis utilities
โ โโโ kv_math.py # KV cache memory calculations
โ โโโ cache.py # Plan caching utilities
โโโ examples/
โ โโโ simple_run.py # Text generation example
โ โโโ embedding_simple.py # Basic embedding example
โ โโโ embedding_similarity.py # Advanced embedding similarity analysis
โโโ pyproject.toml
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built on top of vLLM - the high-performance LLM inference engine
- Uses HuggingFace Transformers for model configuration
๐ Citation
If you use vllm-autoconfig in your research or production systems, please cite:
@software{vllm_speculative_autoconfig,
title = {vllm-autoconfig: Automatic Configuration Planning for vLLM},
author = {Benaya Trabelsi},
year = {2025},
url = {https://github.com:benayat/vllm-speculative-init}
}
๐ Issues and Support
For issues, questions, or feature requests, please open an issue on GitHub Issues.
๐ Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_speculative_autoconfig-0.2.19.tar.gz.
File metadata
- Download URL: vllm_speculative_autoconfig-0.2.19.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56cf82eea2db18950ad0aa150f2d1a4c6836302b893d2979807840166134ef63
|
|
| MD5 |
95653a308272a89f06147540f804a057
|
|
| BLAKE2b-256 |
d1d2107b3a4044e6a3baf86817d044d4601e4f4ec70afb15649c21729411d135
|
File details
Details for the file vllm_speculative_autoconfig-0.2.19-py3-none-any.whl.
File metadata
- Download URL: vllm_speculative_autoconfig-0.2.19-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57abe73ac6ea2056528731a43d2398e894db9af8c209b1e7a5e2b778d906a2e9
|
|
| MD5 |
43977b4ab83126213b332a0826c7bf25
|
|
| BLAKE2b-256 |
cc065527c8a52796181238262201c42677d97893475c98d3a541d095d8cf886e
|