Skip to main content

Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters

Project description

vllm-autoconfig

Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters based on your GPU hardware and model requirements.

๐Ÿš€ Features

  • Zero-configuration vLLM setup: Automatically calculates optimal max_model_len, gpu_memory_utilization, and other vLLM parameters
  • Hardware-aware planning: Probes GPU memory and capabilities using PyTorch to ensure configurations fit your hardware
  • Model-specific optimizations: Applies model-family-specific settings (Mistral, Llama, Qwen, etc.)
  • KV cache sizing: Intelligently calculates memory requirements for attention key-value caches
  • Configuration caching: Saves computed plans to avoid redundant calculations
  • Performance modes: Choose between throughput and latency optimization strategies
  • FP8 KV cache support: Automatically enables FP8 quantization for KV caches when beneficial
  • Simple API: Just specify your model name and desired context length - everything else is handled automatically

๐Ÿ“ฆ Installation

pip install vllm-autoconfig

Requirements:

  • Python >= 3.10
  • PyTorch with CUDA support
  • vLLM
  • Access to CUDA-capable GPU(s)

๐ŸŽฏ Quick Start

Python API

from vllm_autoconfig import AutoVLLMClient, SamplingConfig

# Initialize with your model and desired context length
client = AutoVLLMClient(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    context_len=1024,  # The ONLY parameter you need to set!
)

# Prepare your prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"id": 1},
    }
]

# Run inference
results = client.run_batch(
    prompts, 
    SamplingConfig(max_tokens=100, temperature=0.7)
)

print(results)
client.close()

Advanced Usage

from vllm_autoconfig import AutoVLLMClient, SamplingConfig

# Fine-tune the configuration
client = AutoVLLMClient(
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    context_len=2048,
    perf_mode="latency",          # or "throughput" (default)
    prefer_fp8_kv_cache=True,     # Enable FP8 KV cache if supported
    trust_remote_code=False,       # For models requiring custom code
    debug=True,                    # Enable detailed logging
)

# Check the computed plan
print(f"Plan cache key: {client.plan.cache_key}")
print(f"vLLM kwargs: {client.plan.vllm_kwargs}")
print(f"Notes: {client.plan.notes}")

# Run inference with custom sampling
sampling = SamplingConfig(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256,
    stop=["###", "\n\n"]
)

results = client.run_batch(prompts, sampling)
client.close()

๐Ÿ› ๏ธ How It Works

  1. GPU Probing: Detects available GPU memory and capabilities (BF16 support, compute capability)
  2. Model Analysis: Downloads model configuration from HuggingFace Hub and analyzes architecture
  3. Weight Calculation: Computes actual model weight size from checkpoint files
  4. Memory Planning: Calculates KV cache memory requirements based on context length and batch size
  5. Configuration Generation: Produces optimal vLLM initialization parameters within hardware constraints
  6. Caching: Saves the computed plan for reuse with the same configuration

๐Ÿ“Š Configuration Parameters

The AutoVLLMClient automatically configures:

  • model: Model name/path
  • max_model_len: Maximum sequence length
  • gpu_memory_utilization: GPU memory usage fraction
  • dtype: Weight precision (bfloat16 or float16)
  • kv_cache_dtype: KV cache precision (including FP8 when beneficial)
  • enforce_eager: Whether to use eager mode (affects compilation)
  • trust_remote_code: Whether to trust remote code execution
  • Model-specific parameters (e.g., tokenizer_mode, load_format for Mistral)

๐ŸŽ›๏ธ API Reference

AutoVLLMClient

AutoVLLMClient(
    model_name: str,              # HuggingFace model name or local path
    context_len: int,             # Desired context length
    device_index: int = 0,        # GPU device index
    perf_mode: str = "throughput", # "throughput" or "latency"
    trust_remote_code: bool = False,
    prefer_fp8_kv_cache: bool = False,
    enforce_eager: bool = False,
    local_files_only: bool = False,
    cache_plan: bool = True,      # Cache computed plans
    debug: bool = False,          # Enable debug logging
    vllm_logging_level: str = None, # vLLM logging level
)

SamplingConfig

SamplingConfig(
    temperature: float = 0.0,     # Sampling temperature
    top_p: float = 1.0,           # Nucleus sampling threshold
    max_tokens: int = 32,         # Maximum tokens to generate
    stop: List[str] = None,       # Stop sequences
)

Methods

  • run_batch(prompts, sampling, output_field="output"): Run inference on a batch of prompts
  • close(): Clean up resources and free GPU memory

๐Ÿ—๏ธ Project Structure

vllm-autoconfig/
โ”œโ”€โ”€ src/vllm_autoconfig/
โ”‚   โ”œโ”€โ”€ __init__.py          # Package exports
โ”‚   โ”œโ”€โ”€ client.py            # AutoVLLMClient implementation
โ”‚   โ”œโ”€โ”€ planner.py           # Configuration planning logic
โ”‚   โ”œโ”€โ”€ gpu_probe.py         # GPU detection and probing
โ”‚   โ”œโ”€โ”€ model_probe.py       # Model analysis utilities
โ”‚   โ”œโ”€โ”€ kv_math.py           # KV cache memory calculations
โ”‚   โ””โ”€โ”€ cache.py             # Plan caching utilities
โ”œโ”€โ”€ examples/
โ”‚   โ””โ”€โ”€ simple_run.py        # Usage examples
โ””โ”€โ”€ pyproject.toml

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

๐Ÿ“š Citation

If you use vllm-autoconfig in your research or production systems, please cite:

@software{vllm_speculative_autoconfig,
  title = {vllm-autoconfig: Automatic Configuration Planning for vLLM},
  author = {Benaya Trabelsi},
  year = {2025},
  url = {https://github.com:benayat/vllm-speculative-init}
}

๐Ÿ› Issues and Support

For issues, questions, or feature requests, please open an issue on GitHub Issues.

๐Ÿ”— Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_speculative_autoconfig-0.2.2.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_speculative_autoconfig-0.2.2-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file vllm_speculative_autoconfig-0.2.2.tar.gz.

File metadata

File hashes

Hashes for vllm_speculative_autoconfig-0.2.2.tar.gz
Algorithm Hash digest
SHA256 298f94cc4bf43bb435a92dea870c8287b72597206b15c56732e91111b16b8f71
MD5 ac961d805f84863c02d23502bf8f2019
BLAKE2b-256 60609c3875108a697c620f7e101cf3fc76728d29b9945e3b0b5936589dbafeda

See more details on using hashes here.

File details

Details for the file vllm_speculative_autoconfig-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for vllm_speculative_autoconfig-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ad1696ca38d0c4e7fbab5733f8b4f7cc6f0f0c2cbf70845f16636d72ba7b1c81
MD5 00588c22cbeeb5429cb62cc1245e65ac
BLAKE2b-256 e1dcfae304f1332cbb8de057bd99d3a85c87eda57c014df199ffe2de00451eef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page