Skip to main content

Profile vLLM inference under RL-style rollout workloads.

Project description

hotpath

Profiler for LLM inference. Kernel timing, request lifecycle tracing, and disaggregation analysis for vLLM and SGLang.

What it does

Profile a live vLLM or SGLang endpoint with real traffic: capture CUDA kernel timing, Prometheus server metrics, and per-request latency breakdowns.

Analyze the results: prefill vs decode phase breakdown, KV cache efficiency, prefix sharing patterns, queue depth over time, TTFT and decode-per-token distributions.

Advise on disaggregation: an analytical M/G/1 queueing model estimates whether splitting prefill and decode onto separate GPU pools improves throughput. If recommended, hotpath generates ready-to-use deployment configs for vLLM, llm-d, and Dynamo.

Install

pip install hotpath

Quick start

Profile a live vLLM server:

hotpath serve-profile \
  --endpoint http://localhost:8000 \
  --traffic prompts.jsonl \
  --concurrency 4 \
  --duration 300 \
  --output .hotpath/run

View results:

hotpath serve-report .hotpath/run/serve_profile.db

Generate disaggregation deployment configs:

hotpath disagg-config .hotpath/run/serve_profile.db --format all

For full server-side timing (queue wait, prefill, decode phases), start vLLM with debug logging and pass the log file:

VLLM_LOGGING_LEVEL=DEBUG vllm serve <model> 2>vllm.log &

hotpath serve-profile \
  --endpoint http://localhost:8000 \
  --traffic prompts.jsonl \
  --server-log vllm.log \
  --concurrency 4 \
  --duration 300

For kernel-level GPU phase breakdown, add --nsys:

hotpath serve-profile --endpoint http://localhost:8000 --traffic prompts.jsonl --nsys

Traffic file format

JSONL, one request per line:

{"prompt": "Explain KV cache eviction policy.", "max_tokens": 256}
{"prompt": "Write a Python retry decorator with exponential backoff.", "max_tokens": 400}

ShareGPT format is also accepted.

Commands

Command Description
serve-profile Profile a live vLLM/SGLang server with traffic replay
serve-report Print a serving analysis report
disagg-config Generate deployment configs for disaggregated serving
profile GPU kernel profiling under RL-style rollout workloads
report View a saved kernel profile
diff Compare two kernel profiles
bench Benchmark individual GPU kernel implementations
export Export profile data to JSON, CSV, or OTLP
doctor Check local profiling environment
lock-clocks Lock GPU clocks for reproducible measurements

System requirements

  • Linux
  • NVIDIA GPU with CUDA driver
  • nsys (for kernel profiling; not required for serving analysis)
  • vLLM or SGLang (for serving analysis)

Build from source

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel
ctest --test-dir build --output-on-failure

Install from source:

python3 -m venv .venv && . .venv/bin/activate
pip install .

Requirements: CMake 3.28+, C++20 compiler, SQLite3.

How it works

hotpath is a single C++ binary with no runtime dependencies beyond SQLite3.

Data is collected from three sources:

  1. Kernel traces -- nsys captures GPU kernel execution. hotpath parses the output, categorizes kernels (GEMM, attention, MoE, etc.), and classifies them as prefill or decode phase by timing correlation with server events.

  2. Server metrics -- Prometheus metrics from vLLM or SGLang /metrics endpoints are polled at 1 Hz. Batch size, queue depth, KV cache utilization, and preemption counts are tracked over the profiling window.

  3. Request lifecycle -- vLLM debug logs are parsed to extract per-request timestamps: arrival, queue wait, prefill start, decode start, completion. These are stored as structured traces and can be exported as OpenTelemetry spans.

The disaggregation advisor uses a simplified M/G/1 queueing model to estimate whether splitting prefill and decode onto separate GPU pools would improve throughput. It searches over P:D ratios and accounts for KV transfer overhead to produce a concrete recommendation with estimated throughput improvement.

All data is stored in SQLite databases for offline analysis and comparison across runs.

Release notes

See CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hotpath-0.2.3.tar.gz (208.2 kB view details)

Uploaded Source

File details

Details for the file hotpath-0.2.3.tar.gz.

File metadata

  • Download URL: hotpath-0.2.3.tar.gz
  • Upload date:
  • Size: 208.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hotpath-0.2.3.tar.gz
Algorithm Hash digest
SHA256 0eddbd1a305bfea01b79f5a413eb1fce5cddd0edc8081ecfacb84d8455737bb1
MD5 1cd9146d695f47a766a4447220ebaf7b
BLAKE2b-256 e65ac9293448bc14dd541d30ec3197ec83742a95a85599c4558d27b5cefa7a33

See more details on using hashes here.

Provenance

The following attestation bundles were made for hotpath-0.2.3.tar.gz:

Publisher: release.yml on alityb/hotpath

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page