Skip to main content

A Python package for tuning vLLM hyperparameters.

Project description

vLLM Tuner

CI/Testing build status codecov
Package PyPI version PyPI Downloads Python Version
Meta Ruff GitHub License

📖 About

Automated hyperparameter tuning for vLLM inference servers. Uses Optuna to search over vLLM serving parameters (e.g. gpu_memory_utilization, max_num_seqs, max_num_batched_tokens) and finds configurations that maximize throughput, minimize latency, or balance both — with optional cost analysis.

Key features:

  • YAML config system with built-in presets (high throughput, low latency, balanced, cost-optimized)
  • Supports GPU (NVIDIA) and TPU accelerators
  • Local and Ray distributed execution backends
  • Benchmark providers: GuideLLM, HTTP (httpx), vLLM built-in
  • HTML / JSON / YAML reports, Helm values export
  • Kubernetes-ready with GPU and TPU job manifests

🏗️ Architecture

High-Level Overview

graph TB
    subgraph CLI["CLI Layer"]
        run[run]
        resume[resume]
        report[report]
        export[export]
        validate[validate]
        recommend[recommend]
        list[list]
    end

    subgraph Config["Configuration"]
        yaml[YAML Config / Preset]
        pydantic[Pydantic Models<br/>StudyConfig · BenchmarkConfig<br/>HardwareConfig · ParameterConfig]
    end

    subgraph Engine["Optimization Engine"]
        optuna[Optuna Study<br/>TPE · NSGA-II · Random]
        study[StudyController]
        optimizer[Optimizer<br/>Objective Computation]
    end

    subgraph Execution["Execution Layer"]
        local[Local Backend<br/>Sequential · Subprocess]
        ray[Ray Backend<br/>Distributed · KubeRay]
    end

    subgraph Trial["Trial Lifecycle"]
        launcher[VLLMLauncher<br/>Process Group Management]
        trial_runner[TrialRunner<br/>Orchestrates 6-Step Lifecycle]
        benchmark[Benchmark Provider<br/>GuideLLM · HTTP · vLLM]
        hw_monitor[Hardware Monitor<br/>NVML · TPU · Null]
        telemetry[vLLM Telemetry Parser<br/>OOM · KV Cache · Preemption]
    end

    subgraph Output["Output Layer"]
        reports[Reports<br/>HTML · JSON · YAML]
        config_export[Config Export<br/>YAML · Helm Values]
        dashboard[Live Dashboard<br/>Rich Terminal UI]
    end

    run --> yaml --> pydantic --> study
    study --> optuna
    study --> local & ray
    local & ray --> trial_runner
    trial_runner --> launcher
    trial_runner --> benchmark
    trial_runner --> hw_monitor
    trial_runner --> telemetry
    optimizer --> study
    study --> reports & config_export
    trial_runner -.-> dashboard
    launcher -.-> dashboard

Trial Execution Flow

Each trial follows a 6-step lifecycle managed by TrialRunner:

flowchart LR
    A[Suggest<br/>Parameters] --> B[Start vLLM<br/>Server]
    B --> C{Health<br/>Check}
    C -- Ready --> D[Run<br/>Benchmark]
    C -- Failed --> G[Parse Logs<br/>OOM Detection]
    C -- Timeout --> G
    D --> E[Collect<br/>Metrics]
    E --> F[Stop Server<br/>& Cleanup]
    G --> F
    F --> H[Return<br/>TrialResult]

    style A fill:#4a9eff,color:#fff
    style D fill:#2ecc71,color:#fff
    style G fill:#e74c3c,color:#fff
    style H fill:#9b59b6,color:#fff

Optimization Loop

sequenceDiagram
    participant CLI
    participant Dashboard as Live Dashboard
    participant Baseline as BaselineRunner
    participant Study as StudyController
    participant Optuna
    participant Backend as Execution Backend
    participant Trial as TrialRunner
    participant vLLM as vLLM Server
    participant Bench as Benchmark Provider

    CLI->>Dashboard: start()
    CLI->>Baseline: run_baseline()
    Baseline->>Trial: run_trial(default params)
    Trial->>vLLM: start & health check
    Trial->>Bench: run(server_url)
    Bench-->>Trial: BenchmarkResult
    Trial-->>Baseline: baseline metrics
    Baseline-->>Dashboard: on_baseline_complete()

    CLI->>Study: optimize()
    loop N trials
        Study->>Optuna: suggest parameters
        Optuna-->>Study: trial params
        Study->>Backend: submit_trial(config)
        Backend->>Trial: run_trial(config)
        Trial->>vLLM: start(params)
        Trial->>vLLM: wait_until_ready()
        vLLM-->>Trial: ready
        Trial->>Bench: run(server_url)
        Bench-->>Trial: BenchmarkResult
        Trial->>Trial: collect telemetry & HW stats
        Trial->>vLLM: stop()
        Trial-->>Backend: TrialResult
        Backend-->>Study: results
        Study->>Optuna: report objective values
        Study-->>Dashboard: on_trial_complete()
    end
    Study-->>CLI: best config + report

Component Map

graph LR
    subgraph "vllm_tuner"
        cli["cli/<br/>main.py · rich_ui.py"]
        core["core/<br/>study_controller.py<br/>trial.py · models.py<br/>optimizer.py"]
        benchmarks["benchmarks/<br/>guidellm.py<br/>http_client.py<br/>vllm_benchmark.py"]
        execution["execution/<br/>local.py<br/>ray_backend.py"]
        vllm_mod["vllm/<br/>launcher.py<br/>telemetry.py"]
        hardware["hardware/<br/>nvml.py · null.py"]
        reporting["reporting/<br/>live_dashboard.py<br/>html.py · json_report.py"]
        baseline["baseline/<br/>runner.py"]
        config["config/<br/>loader.py · presets.py"]
    end

    cli --> core
    cli --> config
    core --> execution
    core --> benchmarks
    core --> vllm_mod
    core --> hardware
    execution --> core
    baseline --> core
    baseline --> vllm_mod
    cli --> baseline
    cli --> reporting
    core --> reporting

🚀 Quick Start

Prerequisites

[!IMPORTANT]

  • Python >= 3.11, < 3.14
  • uv package manager (installation guide)
  • vLLM installed on the target machine
  • Hugging Face account with API token for gated model access

Installation

# Install vLLM
uv pip install vllm

# Install vllm-tuner from PyPI
uv pip install vllm-tuner

# With all optional extras (GPU monitoring, GuideLLM, HTTP, Ray)
uv pip install "vllm-tuner[all]"

Running a Tuning Study

Minimal — single model, default settings

python -m vllm_tuner.cli.main run --model "Qwen/Qwen2.5-0.5B-Instruct"

This runs 50 trials with the high_throughput preset on a local GPU using the GuideLLM benchmark provider.

With a config file

Use one of the bundled base configs, or create your own:

Config Description
configs/high_throughput_gpu.yaml Single GPU, maximize tokens/s
configs/low_latency_gpu.yaml Single GPU, minimize p95 latency
configs/balanced_gpu.yaml Single GPU, multi-objective (Pareto)
configs/high_throughput_tpu.yaml TPU (GKE), maximize tokens/s
configs/multi_gpu.yaml 4× GPU with tensor parallelism
python -m vllm_tuner.cli.main run --config configs/high_throughput_gpu.yaml

# Override model from CLI
python -m vllm_tuner.cli.main run --config configs/low_latency_gpu.yaml --model "Qwen/Qwen2.5-7B-Instruct"

Low-latency optimization

python -m vllm_tuner.cli.main run \
    --model "meta-llama/Llama-3.1-8B-Instruct" \
    --preset low_latency \
    --n_trials 30 \
    --output_dir ./results

Available Presets

Preset Objective Description
high_throughput maximize tokens/s Best for batch inference workloads
low_latency minimize p95 latency Best for real-time applications
balanced multi-objective Trade-off between throughput & latency
cost_optimized maximize throughput per $ Best for cost-sensitive deployments

CLI Commands

Command Description
run Start a new tuning study
resume Resume an interrupted study
report Generate reports from a completed study
export Export optimal config (YAML/JSON/Helm)
list List available presets, backends, or benchmark providers
validate Validate a configuration file
recommend Recommend vLLM parameters for a model
# Resume an interrupted study
python -m vllm_tuner.cli.main resume --study_name my-study --storage sqlite:///study.db

# Generate an HTML report from a completed study
python -m vllm_tuner.cli.main report --study_name my-study --output_dir ./results

# Export the best config as YAML (also supports --helm for Helm values)
python -m vllm_tuner.cli.main export --study_name my-study --format yaml --output best.yaml

# Recommend vLLM parameters based on model and hardware (GPU)
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-8B-Instruct" --vram 24 --num_gpus 1

# Recommend vLLM parameters for TPU (v6e, 8 chips per host by default)
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-8B-Instruct" --device tpu --chip_type v6e

# Recommend for a specific number of TPU chips
python -m vllm_tuner.cli.main recommend --model "meta-llama/Llama-3.1-70B-Instruct" --device tpu --chip_type v5p --num_chips 8

# List available presets, backends, or benchmark providers
python -m vllm_tuner.cli.main list --what presets

# Validate a config file
python -m vllm_tuner.cli.main validate --config my_study.yaml

⚙️ Development Environment Setup

[!WARNING] This project is based on Python 3.13 and uses uv for dependency management.

  1. Clone the repository:

    git clone <repository-url>
    cd <your-repo-name>
    
  2. Install uv following the official documentation.

  3. Create a virtual environment:

    uv venv --python 3.13
    
  4. Activate the environment:

    source .venv/bin/activate
    
  5. Install dependencies:

    uv sync --all-extras --no-install-project
    
  6. Setup pre-commit hooks:

    pre-commit install
    

📝 Contributing

  1. Fork repository and create a feature branch.
  2. Follow existing code style (enforced by pre-commit hooks).
  3. Add tests for new functionality.
  4. Submit a PR for review.

📖 Useful Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_tuner-0.1.4.tar.gz (251.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_tuner-0.1.4-py3-none-any.whl (78.8 kB view details)

Uploaded Python 3

File details

Details for the file vllm_tuner-0.1.4.tar.gz.

File metadata

  • Download URL: vllm_tuner-0.1.4.tar.gz
  • Upload date:
  • Size: 251.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_tuner-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7a577dd67a4099fbead71a0e2934c2fbf019fdceb0ca193efdd1a8643a9006e8
MD5 08dd67523912087b2bec00239de26b98
BLAKE2b-256 fdff08bfbba6a34ea8362a4ef530f11e1eee1b74b214e878df3921f52e3880e0

See more details on using hashes here.

File details

Details for the file vllm_tuner-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: vllm_tuner-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 78.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_tuner-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 61eec6ebc51df2bb243e640c19554dda5bae52a0e81856216fbf20fa7b24e1a5
MD5 077826f530fc9883a37980bca9306c62
BLAKE2b-256 01fb11e770da45c1088f621448e30d214f4bfcef3e3aeb2cbdb93b0161d223cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page