A Python package for tuning vLLM hyperparameters.

These details have not been verified by PyPI

Project description

vLLM Tuner


CI/Testing
Package
Meta

📖 About

Automated hyperparameter tuning for vLLM inference servers. Uses Optuna to search over vLLM serving parameters (e.g. gpu_memory_utilization, max_num_seqs, max_num_batched_tokens) and finds configurations that maximize throughput, minimize latency, or balance both — with optional cost analysis.

Key features:

YAML config system with built-in presets (high throughput, low latency, balanced, cost-optimized)
Supports GPU (NVIDIA) and TPU accelerators
Local and Ray distributed execution backends
Benchmark providers: GuideLLM, HTTP (httpx), vLLM built-in
HTML / JSON / YAML reports, Helm values export
Kubernetes-ready with GPU and TPU job manifests

🏗️ Architecture

High-Level Overview

graph TB
    subgraph CLI["CLI Layer"]
        run[run]
        resume[resume]
        report[report]
        export[export]
        validate[validate]
        recommend[recommend]
        list[list]
    end

    subgraph Config["Configuration"]
        yaml[YAML Config / Preset]
        pydantic[Pydantic Models<br/>StudyConfig · BenchmarkConfig<br/>HardwareConfig · ParameterConfig]
    end

    subgraph Engine["Optimization Engine"]
        optuna[Optuna Study<br/>TPE · NSGA-II · Random]
        study[StudyController]
        optimizer[Optimizer<br/>Objective Computation]
    end

    subgraph Execution["Execution Layer"]
        local[Local Backend<br/>Sequential · Subprocess]
        ray[Ray Backend<br/>Distributed · KubeRay]
    end

    subgraph Trial["Trial Lifecycle"]
        launcher[VLLMLauncher<br/>Process Group Management]
        trial_runner[TrialRunner<br/>Orchestrates 6-Step Lifecycle]
        benchmark[Benchmark Provider<br/>GuideLLM · HTTP · vLLM]
        hw_monitor[Hardware Monitor<br/>NVML · TPU · Null]
        telemetry[vLLM Telemetry Parser<br/>OOM · KV Cache · Preemption]
    end

    subgraph Output["Output Layer"]
        reports[Reports<br/>HTML · JSON · YAML]
        config_export[Config Export<br/>YAML · Helm Values]
        dashboard[Live Dashboard<br/>Rich Terminal UI]
    end

    run --> yaml --> pydantic --> study
    study --> optuna
    study --> local & ray
    local & ray --> trial_runner
    trial_runner --> launcher
    trial_runner --> benchmark
    trial_runner --> hw_monitor
    trial_runner --> telemetry
    optimizer --> study
    study --> reports & config_export
    trial_runner -.-> dashboard
    launcher -.-> dashboard

Trial Execution Flow

Each trial follows a 6-step lifecycle managed by TrialRunner:

flowchart LR
    A[Suggest<br/>Parameters] --> B[Start vLLM<br/>Server]
    B --> C{Health<br/>Check}
    C -- Ready --> D[Run<br/>Benchmark]
    C -- Failed --> G[Parse Logs<br/>OOM Detection]
    C -- Timeout --> G
    D --> E[Collect<br/>Metrics]
    E --> F[Stop Server<br/>& Cleanup]
    G --> F
    F --> H[Return<br/>TrialResult]

    style A fill:#4a9eff,color:#fff
    style D fill:#2ecc71,color:#fff
    style G fill:#e74c3c,color:#fff
    style H fill:#9b59b6,color:#fff

Optimization Loop

sequenceDiagram
    participant CLI
    participant Dashboard as Live Dashboard
    participant Baseline as BaselineRunner
    participant Study as StudyController
    participant Optuna
    participant Backend as Execution Backend
    participant Trial as TrialRunner
    participant vLLM as vLLM Server
    participant Bench as Benchmark Provider

    CLI->>Dashboard: start()
    CLI->>Baseline: run_baseline()
    Baseline->>Trial: run_trial(default params)
    Trial->>vLLM: start & health check
    Trial->>Bench: run(server_url)
    Bench-->>Trial: BenchmarkResult
    Trial-->>Baseline: baseline metrics
    Baseline-->>Dashboard: on_baseline_complete()

    CLI->>Study: optimize()
    loop N trials
        Study->>Optuna: suggest parameters
        Optuna-->>Study: trial params
        Study->>Backend: submit_trial(config)
        Backend->>Trial: run_trial(config)
        Trial->>vLLM: start(params)
        Trial->>vLLM: wait_until_ready()
        vLLM-->>Trial: ready
        Trial->>Bench: run(server_url)
        Bench-->>Trial: BenchmarkResult
        Trial->>Trial: collect telemetry & HW stats
        Trial->>vLLM: stop()
        Trial-->>Backend: TrialResult
        Backend-->>Study: results
        Study->>Optuna: report objective values
        Study-->>Dashboard: on_trial_complete()
    end
    Study-->>CLI: best config + report

Component Map

graph LR
    subgraph "vllm_tuner"
        cli["cli/<br/>main.py · rich_ui.py"]
        core["core/<br/>study_controller.py<br/>trial.py · models.py<br/>optimizer.py"]
        benchmarks["benchmarks/<br/>guidellm.py<br/>http_client.py<br/>vllm_benchmark.py"]
        execution["execution/<br/>local.py<br/>ray_backend.py"]
        vllm_mod["vllm/<br/>launcher.py<br/>telemetry.py"]
        hardware["hardware/<br/>nvml.py · null.py"]
        reporting["reporting/<br/>live_dashboard.py<br/>html.py · json_report.py"]
        baseline["baseline/<br/>runner.py"]
        config["config/<br/>loader.py · presets.py"]
    end

    cli --> core
    cli --> config
    core --> execution
    core --> benchmarks
    core --> vllm_mod
    core --> hardware
    execution --> core
    baseline --> core
    baseline --> vllm_mod
    cli --> baseline
    cli --> reporting
    core --> reporting

🚀 Quick Start

Prerequisites

[!IMPORTANT]

Python >= 3.11, < 3.14

uv package manager (installation guide)

vLLM installed on the target machine

Hugging Face account with API token for gated model access

Installation

# Install vLLM
uv pip install vllm

# Install vllm-tuner from PyPI
uv pip install vllm-tuner

# With all optional extras (GPU monitoring, GuideLLM, HTTP, Ray)
uv pip install "vllm-tuner[all]"

Running a Tuning Study

Minimal — single model, default settings

python -m vllm_tuner.cli.main run --model "Qwen/Qwen2.5-0.5B-Instruct"

This runs 50 trials with the high_throughput preset on a local GPU using the GuideLLM benchmark provider.

With a config file

Use one of the bundled base configs, or create your own:

Config	Description
`configs/high_throughput_gpu.yaml`	Single GPU, maximize tokens/s
`configs/low_latency_gpu.yaml`	Single GPU, minimize p95 latency
`configs/balanced_gpu.yaml`	Single GPU, multi-objective (Pareto)
`configs/high_throughput_tpu.yaml`	TPU (GKE), maximize tokens/s
`configs/multi_gpu.yaml`	4× GPU with tensor parallelism

python -m vllm_tuner.cli.main run --config configs/high_throughput_gpu.yaml

# Override model from CLI
python -m vllm_tuner.cli.main run --config configs/low_latency_gpu.yaml --model "Qwen/Qwen2.5-7B-Instruct"

Low-latency optimization

python -m vllm_tuner.cli.main run \
    --model "meta-llama/Llama-3.1-8B-Instruct" \
    --preset low_latency \
    --n_trials 30 \
    --output_dir ./results

Available Presets

Preset	Objective	Description
`high_throughput`	maximize tokens/s	Best for batch inference workloads
`low_latency`	minimize p95 latency	Best for real-time applications
`balanced`	multi-objective	Trade-off between throughput & latency
`cost_optimized`	maximize throughput per $	Best for cost-sensitive deployments

CLI Commands

Command	Description
`run`	Start a new tuning study
`resume`	Resume an interrupted study
`report`	Generate reports from a completed study
`export`	Export optimal config (YAML/JSON/Helm)
`list`	List available presets, backends, or benchmark providers
`validate`	Validate a configuration file
`recommend`	Recommend vLLM parameters for a model

# Resume an interrupted study
python -m vllm_tuner.cli.main resume --study_name my-study --storage sqlite:///study.db

# Generate an HTML report from a completed study
python -m vllm_tuner.cli.main report --study_name my-study --output_dir ./results

# Export the best config as YAML (also supports --helm for Helm values)
python -m vllm_tuner.cli.main export --study_name my-study --format yaml --output best.yaml

# Recommend vLLM parameters based on model and hardware (GPU)
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-8B-Instruct" --vram 24 --num_gpus 1

# Recommend vLLM parameters for TPU (v6e, 8 chips per host by default)
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-8B-Instruct" --device tpu --chip_type v6e

# Recommend for a specific number of TPU chips
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-70B-Instruct" --device tpu --chip_type v5p --num_chips 8

# List available presets, backends, or benchmark providers
python -m vllm_tuner.cli.main list --what presets

# Validate a config file
python -m vllm_tuner.cli.main validate --config my_study.yaml

⚙️ Development Environment Setup

[!WARNING] This project is based on Python 3.13 and uses uv for dependency management.

Clone the repository:

git clone <repository-url>
cd <your-repo-name>

Install uv following the official documentation.
Create a virtual environment:
```
uv venv --python 3.13
```
Activate the environment:
```
source .venv/bin/activate
```

Install dependencies:

uv sync --all-extras --no-install-project

Setup pre-commit hooks:
```
pre-commit install
```

📝 Contributing

Fork repository and create a feature branch.
Follow existing code style (enforced by pre-commit hooks).
Add tests for new functionality.
Submit a PR for review.

📖 Useful Resources

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

Mar 31, 2026

0.1.5

Mar 26, 2026

This version

0.1.4

Mar 26, 2026

0.1.3

Mar 25, 2026

0.1.2

Mar 18, 2026

0.1.1

Mar 15, 2026

0.1.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_tuner-0.1.4.tar.gz (251.7 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_tuner-0.1.4-py3-none-any.whl (78.8 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file vllm_tuner-0.1.4.tar.gz.

File metadata

Download URL: vllm_tuner-0.1.4.tar.gz
Upload date: Mar 26, 2026
Size: 251.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_tuner-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`7a577dd67a4099fbead71a0e2934c2fbf019fdceb0ca193efdd1a8643a9006e8`
MD5	`08dd67523912087b2bec00239de26b98`
BLAKE2b-256	`fdff08bfbba6a34ea8362a4ef530f11e1eee1b74b214e878df3921f52e3880e0`

See more details on using hashes here.

File details

Details for the file vllm_tuner-0.1.4-py3-none-any.whl.

File metadata

Download URL: vllm_tuner-0.1.4-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 78.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_tuner-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61eec6ebc51df2bb243e640c19554dda5bae52a0e81856216fbf20fa7b24e1a5`
MD5	`077826f530fc9883a37980bca9306c62`
BLAKE2b-256	`01fb11e770da45c1088f621448e30d214f4bfcef3e3aeb2cbdb93b0161d223cd`

See more details on using hashes here.

vllm-tuner 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

vLLM Tuner

📖 About

🏗️ Architecture

High-Level Overview

Trial Execution Flow

Optimization Loop

Component Map

🚀 Quick Start

Prerequisites

Installation

Running a Tuning Study

Minimal — single model, default settings

With a config file

Low-latency optimization

Available Presets

CLI Commands

⚙️ Development Environment Setup

📝 Contributing

📖 Useful Resources

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes