Profile vLLM inference under RL-style rollout workloads.
Project description
hotpath
Profiler for LLM inference. Kernel timing, request lifecycle tracing, and disaggregation analysis for vLLM and SGLang.
What it does
Profile a live vLLM or SGLang endpoint with real traffic: capture CUDA kernel timing, Prometheus server metrics, and per-request latency breakdowns.
Analyze the results: prefill vs decode phase breakdown, KV cache efficiency, prefix sharing patterns, queue depth over time, TTFT and decode-per-token distributions.
Advise on disaggregation: an analytical M/G/1 queueing model estimates whether splitting prefill and decode onto separate GPU pools improves throughput. If recommended, hotpath generates ready-to-use deployment configs for vLLM, llm-d, and Dynamo.
Install
pip install hotpath
Quick start
Profile a live vLLM server:
hotpath serve-profile \
--endpoint http://localhost:8000 \
--traffic prompts.jsonl \
--concurrency 4 \
--duration 300 \
--output .hotpath/run
View results:
hotpath serve-report .hotpath/run/serve_profile.db
Generate disaggregation deployment configs:
hotpath disagg-config .hotpath/run/serve_profile.db --format all
For full server-side timing (queue wait, prefill, decode phases), start vLLM with debug logging and pass the log file:
VLLM_LOGGING_LEVEL=DEBUG vllm serve <model> 2>vllm.log &
hotpath serve-profile \
--endpoint http://localhost:8000 \
--traffic prompts.jsonl \
--server-log vllm.log \
--concurrency 4 \
--duration 300
For kernel-level GPU phase breakdown, add --nsys:
hotpath serve-profile --endpoint http://localhost:8000 --traffic prompts.jsonl --nsys
Traffic file format
JSONL, one request per line:
{"prompt": "Explain KV cache eviction policy.", "max_tokens": 256}
{"prompt": "Write a Python retry decorator with exponential backoff.", "max_tokens": 400}
ShareGPT format is also accepted.
Commands
| Command | Description |
|---|---|
serve-profile |
Profile a live vLLM/SGLang server with traffic replay |
serve-report |
Print a serving analysis report |
disagg-config |
Generate deployment configs for disaggregated serving |
profile |
GPU kernel profiling under RL-style rollout workloads |
report |
View a saved kernel profile |
diff |
Compare two kernel profiles |
bench |
Benchmark individual GPU kernel implementations |
export |
Export profile data to JSON, CSV, or OTLP |
doctor |
Check local profiling environment |
lock-clocks |
Lock GPU clocks for reproducible measurements |
System requirements
- Linux
- NVIDIA GPU with CUDA driver
nsys(for kernel profiling; not required for serving analysis)- vLLM or SGLang (for serving analysis)
Build from source
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel
ctest --test-dir build --output-on-failure
Install from source:
python3 -m venv .venv && . .venv/bin/activate
pip install .
Requirements: CMake 3.28+, C++20 compiler, SQLite3.
How it works
hotpath is a single C++ binary with no runtime dependencies beyond SQLite3.
Data is collected from three sources:
-
Kernel traces -- nsys captures GPU kernel execution. hotpath parses the output, categorizes kernels (GEMM, attention, MoE, etc.), and classifies them as prefill or decode phase by timing correlation with server events.
-
Server metrics -- Prometheus metrics from vLLM or SGLang
/metricsendpoints are polled at 1 Hz. Batch size, queue depth, KV cache utilization, and preemption counts are tracked over the profiling window. -
Request lifecycle -- vLLM debug logs are parsed to extract per-request timestamps: arrival, queue wait, prefill start, decode start, completion. These are stored as structured traces and can be exported as OpenTelemetry spans.
The disaggregation advisor uses a simplified M/G/1 queueing model to estimate whether splitting prefill and decode onto separate GPU pools would improve throughput. It searches over P:D ratios and accounts for KV transfer overhead to produce a concrete recommendation with estimated throughput improvement.
All data is stored in SQLite databases for offline analysis and comparison across runs.
Release notes
See CHANGELOG.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hotpath-0.2.3.tar.gz.
File metadata
- Download URL: hotpath-0.2.3.tar.gz
- Upload date:
- Size: 208.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0eddbd1a305bfea01b79f5a413eb1fce5cddd0edc8081ecfacb84d8455737bb1
|
|
| MD5 |
1cd9146d695f47a766a4447220ebaf7b
|
|
| BLAKE2b-256 |
e65ac9293448bc14dd541d30ec3197ec83742a95a85599c4558d27b5cefa7a33
|
Provenance
The following attestation bundles were made for hotpath-0.2.3.tar.gz:
Publisher:
release.yml on alityb/hotpath
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hotpath-0.2.3.tar.gz -
Subject digest:
0eddbd1a305bfea01b79f5a413eb1fce5cddd0edc8081ecfacb84d8455737bb1 - Sigstore transparency entry: 1239448768
- Sigstore integration time:
-
Permalink:
alityb/hotpath@a43807f84c659e413b41a56925d017b66e0bf25d -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/alityb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a43807f84c659e413b41a56925d017b66e0bf25d -
Trigger Event:
release
-
Statement type: