Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters

These details have not been verified by PyPI

Project links

Project description

vllm-autoconfig

Automatic configuration planner for vLLM - Eliminate the guesswork of configuring vLLM by automatically determining optimal parameters based on your GPU hardware and model requirements.

🚀 Features

Zero-configuration vLLM setup: Automatically calculates optimal max_model_len, gpu_memory_utilization, and other vLLM parameters
Hardware-aware planning: Probes GPU memory and capabilities using PyTorch to ensure configurations fit your hardware
Model-specific optimizations: Applies model-family-specific settings (Mistral, Llama, Qwen, etc.)
KV cache sizing: Intelligently calculates memory requirements for attention key-value caches
Configuration caching: Saves computed plans to avoid redundant calculations
Performance modes: Choose between throughput and latency optimization strategies
FP8 KV cache support: Automatically enables FP8 quantization for KV caches when beneficial
Simple API: Just specify your model name and desired context length - everything else is handled automatically

📦 Installation

pip install vllm-speculative-autoconfig

Requirements:

Python >= 3.10
PyTorch with CUDA support
vLLM
Access to CUDA-capable GPU(s)

🎯 Quick Start

Python API

from vllm_autoconfig import AutoVLLMClient, SamplingConfig

# Initialize with your model and desired context length
client = AutoVLLMClient(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    context_len=1024,  # The ONLY parameter you need to set!
)

# Prepare your prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"id": 1},
    }
]

# Run inference: chat format
results = client.run_batch_chat(
    prompts,
    SamplingConfig(max_tokens=100, temperature=0.7)
)

# Alternatively, run standard text generation
results = client.run_batch_raw(
    prompts,
    SamplingConfig(max_tokens=100, temperature=0.7)
)
print(results)
client.close()

Advanced Usage

from vllm_autoconfig import AutoVLLMClient, SamplingConfig

# Fine-tune the configuration
client = AutoVLLMClient(
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    context_len=2048,
    perf_mode="latency",          # or "throughput" (default)
    prefer_fp8_kv_cache=True,     # Enable FP8 KV cache if supported
    trust_remote_code=False,       # For models requiring custom code
    debug=True,                    # Enable detailed logging
)

# Check the computed plan
print(f"Plan cache key: {client.plan.cache_key}")
print(f"vLLM kwargs: {client.plan.vllm_kwargs}")
print(f"Notes: {client.plan.notes}")

# Run inference with custom sampling
sampling = SamplingConfig(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256,
    stop=["###", "\n\n"]
)

results = client.run_batch_chat(prompts, sampling)
client.close()

Embedding Generation

from vllm_autoconfig import AutoVLLMEmbedding
import numpy as np

# Initialize embedding client - only specify model and max length!
# GPU settings are automatically configured
client = AutoVLLMEmbedding(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    max_model_len=512,  # Typical for embeddings
)

# Generate embeddings for texts
texts = [
    "The cat sits on the mat",
    "A feline rests on the carpet",
    "Dogs are playing in the park",
]

embeddings = client.embed(texts, normalize=True)
print(f"Embeddings shape: {embeddings.shape}")  # (3, embedding_dim)

# Compute cosine similarity (embeddings are already normalized)
def cosine_similarity(a, b):
    return np.dot(a, b)

sim = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity between 'cat' and 'feline': {sim:.4f}")

client.close()

🛠️ How It Works

GPU Probing: Detects available GPU memory and capabilities (BF16 support, compute capability)
Model Analysis: Downloads model configuration from HuggingFace Hub and analyzes architecture
Weight Calculation: Computes actual model weight size from checkpoint files
Memory Planning: Calculates KV cache memory requirements based on context length and batch size
Configuration Generation: Produces optimal vLLM initialization parameters within hardware constraints
Caching: Saves the computed plan for reuse with the same configuration

📊 Configuration Parameters

The AutoVLLMClient automatically configures:

model: Model name/path
max_model_len: Maximum sequence length
gpu_memory_utilization: GPU memory usage fraction
dtype: Weight precision (bfloat16 or float16)
kv_cache_dtype: KV cache precision (including FP8 when beneficial)
enforce_eager: Whether to use eager mode (affects compilation)
trust_remote_code: Whether to trust remote code execution
Model-specific parameters (e.g., tokenizer_mode, load_format for Mistral)

🎛️ API Reference

`AutoVLLMClient`

AutoVLLMClient(
    model_name: str,              # HuggingFace model name or local path
    context_len: int,             # Desired context length
    device_index: int = 0,        # GPU device index
    perf_mode: str = "throughput", # "throughput" or "latency"
    trust_remote_code: bool = False,
    prefer_fp8_kv_cache: bool = False,
    enforce_eager: bool = False,
    local_files_only: bool = False,
    cache_plan: bool = True,      # Cache computed plans
    debug: bool = False,          # Enable debug logging
    vllm_logging_level: str = None, # vLLM logging level
)

`SamplingConfig`

SamplingConfig(
    temperature: float = 0.0,     # Sampling temperature
    top_p: float = 1.0,           # Nucleus sampling threshold
    max_tokens: int = 32,         # Maximum tokens to generate
    stop: List[str] = None,       # Stop sequences
)

Methods

run_batch(prompts, sampling, output_field="output"): Run inference on a batch of prompts
close(): Clean up resources and free GPU memory

`AutoVLLMEmbedding`

AutoVLLMEmbedding(
    model_name: str,              # HuggingFace model name or local path
    max_model_len: int = 512,     # Maximum sequence length (embeddings typically need less)
    pooling_type: str = "MEAN",   # "MEAN", "CLS", or "LAST"
    normalize: bool = False,      # Normalize in vLLM (default: False, normalized manually)
    device_index: int = 0,        # GPU device index
    perf_mode: str = "throughput", # "throughput" or "latency"
    trust_remote_code: bool = False,
    enforce_eager: bool = True,   # Better compatibility for embeddings
    local_files_only: bool = False,
    cache_plan: bool = True,      # Cache computed plans
    debug: bool = False,          # Enable debug logging
    vllm_logging_level: str = None,
)

Note: GPU memory utilization, tensor parallelism, dtype, and other hardware-specific settings are automatically configured by the planner based on your GPU and model. You don't need to specify them!

Methods

embed(texts, normalize=True): Generate embeddings for a list of texts, returns numpy array (N, D)
embed_batch(texts, normalize=True): Alias for embed()
close(): Clean up resources and free GPU memory

🏗️ Project Structure

vllm-autoconfig/
├── src/vllm_autoconfig/
│   ├── __init__.py          # Package exports
│   ├── client.py            # AutoVLLMClient implementation (text generation)
│   ├── embedding.py         # AutoVLLMEmbedding implementation (embeddings)
│   ├── planner.py           # Configuration planning logic
│   ├── gpu_probe.py         # GPU detection and probing
│   ├── model_probe.py       # Model analysis utilities
│   ├── kv_math.py           # KV cache memory calculations
│   └── cache.py             # Plan caching utilities
├── examples/
│   ├── simple_run.py        # Text generation example
│   ├── embedding_simple.py  # Basic embedding example
│   └── embedding_similarity.py  # Advanced embedding similarity analysis
└── pyproject.toml

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of vLLM - the high-performance LLM inference engine
Uses HuggingFace Transformers for model configuration

📚 Citation

If you use vllm-autoconfig in your research or production systems, please cite:

@software{vllm_speculative_autoconfig,
  title = {vllm-autoconfig: Automatic Configuration Planning for vLLM},
  author = {Benaya Trabelsi},
  year = {2025},
  url = {https://github.com:benayat/vllm-speculative-init}
}

🐛 Issues and Support

For issues, questions, or feature requests, please open an issue on GitHub Issues.

🔗 Links

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.20

Feb 10, 2026

0.2.19

Feb 8, 2026

0.2.18

Feb 7, 2026

0.2.17

Feb 7, 2026

0.2.16

Feb 7, 2026

0.2.15

Feb 7, 2026

0.2.14

Feb 7, 2026

0.2.13

Feb 4, 2026

0.2.12

Feb 4, 2026

0.2.11

Feb 4, 2026

0.2.10

Jan 30, 2026

0.2.9

Jan 22, 2026

This version

0.2.8

Jan 22, 2026

0.2.7

Jan 22, 2026

0.2.6

Jan 22, 2026

0.2.5

Jan 22, 2026

0.2.4

Jan 22, 2026

0.2.3

Dec 17, 2025

0.2.2

Dec 10, 2025

0.1.2

Dec 10, 2025

0.1.1

Dec 10, 2025

0.1.0

Dec 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_speculative_autoconfig-0.2.8.tar.gz (21.2 kB view details)

Uploaded Jan 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_speculative_autoconfig-0.2.8-py3-none-any.whl (21.8 kB view details)

Uploaded Jan 22, 2026 Python 3

File details

Details for the file vllm_speculative_autoconfig-0.2.8.tar.gz.

File metadata

Download URL: vllm_speculative_autoconfig-0.2.8.tar.gz
Upload date: Jan 22, 2026
Size: 21.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_speculative_autoconfig-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`35e7c1002702921e2ba7632315d9065ad0a3988a848d911485697c77ebc1d683`
MD5	`7a74ef4599f433bd6597be159bc9b767`
BLAKE2b-256	`7e5d5956cddfa211b708bb2873f17d91d6a2ac141ef16310f790f645a7d05678`

See more details on using hashes here.

File details

Details for the file vllm_speculative_autoconfig-0.2.8-py3-none-any.whl.

File metadata

Download URL: vllm_speculative_autoconfig-0.2.8-py3-none-any.whl
Upload date: Jan 22, 2026
Size: 21.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_speculative_autoconfig-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f59ec43747fa693b2cc0fe59cf3631f8be0c664ffec31d09813120868ef02803`
MD5	`46faf5674bacaa3d3b585949eb1952e8`
BLAKE2b-256	`b8cf6e9269e198fc82e172c0019d9594710ae219bf72004cdcab3eb81633e7d1`

See more details on using hashes here.

vllm-speculative-autoconfig 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vllm-autoconfig

🚀 Features

📦 Installation

🎯 Quick Start

Python API

Advanced Usage

Embedding Generation

🛠️ How It Works

📊 Configuration Parameters

🎛️ API Reference

AutoVLLMClient

SamplingConfig

Methods

AutoVLLMEmbedding

Methods

🏗️ Project Structure

🤝 Contributing

📝 License

🙏 Acknowledgments

📚 Citation

🐛 Issues and Support

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`AutoVLLMClient`

`SamplingConfig`

`AutoVLLMEmbedding`