In-kernel %globaltimer profiler for warp-specialized CUDA kernels (Perfetto/Chrome timelines).
Project description
warpscope
English | 中文
Credits: the idea and design all come from 侯博涵 (Hou Bohan)'s write-up (zhihu); the wire format and the host-side decode / Perfetto export are adapted from Apache TVM TIRx
CudaProfiler(bench.py, docs). The implementation in this repo was written by Claude Opus (Anthropic).
In-kernel %globaltimer profiler for warp-specialized CUDA kernels. Bracket the
logical stages inside a kernel (TMA load, MMA, softmax, epilogue, ...) with
start/end markers; one leader thread per logical group stamps the GPU global timer
into a buffer you pass as an ordinary kernel argument. Decode it on the host into
per-(block, group) durations or a Perfetto / Chrome trace to see how the
producer and consumer warp-groups actually overlap — something total launch time and
SM-level counters can't show.
It is not zero cost (a timer read + a global store + a block fence per event), so it is a debugging/analysis tool. Build with the profiler disabled for production.
Layout
warpscope/
include/
warpscope.cuh # device header (header-only, NVRTC-safe)
warpscope_host.hpp # host decoder + Chrome-trace writer (header-only, pure C++)
*.py # Python: Profiler buffer mgmt, decode, trace export
examples/ # toy CUDA program (pure C++ path) + python driver
tests/ # wire-format + decode tests
Install
pip install -e . # core (numpy only)
pip install -e ".[torch]" # + GPU buffer allocation
pip install -e ".[dev]" # + pytest
Device side (CUDA C++)
#include <warpscope.cuh>
enum : uint32_t { EvWait = 0, EvWork = 1 };
__global__ void k(..., uint64_t* prof, uint32_t stride,
uint32_t num_groups, uint32_t num_blocks, uint32_t max_rec) {
ws::Profiler<true> p; // <false> compiles to a no-op
const uint32_t warp = threadIdx.x / 32, lane = threadIdx.x % 32;
if (warp == 0) { // e.g. TMA producer = group 0
p.init(prof, stride, /*group=*/0, num_groups, num_blocks,
/*leader=*/lane == 0, max_rec);
{ WS_REGION(p, EvWait); /* barrier wait */ } // RAII start/end
{ WS_REGION(p, EvWork); /* issue work */ }
p.finalize();
}
// ... other warp-groups: init with their own group id + one leader each ...
}
Build: nvcc -I"$(warpscope --include)" -arch=sm_100a my.cu
Host side — pick one
Pure C++ (header-only, no Python):
#include <warpscope_host.hpp>
std::vector<uint64_t> h(slots); // cudaMemcpy buffer back into h
ws::write_chrome_trace(h.data(), h.size(),
/*events*/ {"wait", "work"},
/*groups*/ {"tma", "umma", "utccp", "epilogue"},
"trace.json"); // open in chrome://tracing or perfetto
Python:
import warpscope as ws
prof = ws.Profiler(num_blocks=num_sms, num_groups=4, max_records_per_lane=64)
launch(..., prof.ptr) # pass the device pointer
torch.cuda.synchronize()
res = prof.decode(event_names={0: "wait", 1: "work"},
group_names={0: "tma", 1: "umma", 2: "utccp", 3: "epilogue"})
res.print_durations()
res.to_perfetto("trace.json") # Chrome JSON; opens in ui.perfetto.dev too
Output
The raw output is a uint64 buffer. Both host paths turn it into a Chrome Trace
Event JSON file (pid = block, tid = group, ts/dur in microseconds) that opens
directly in chrome://tracing and https://ui.perfetto.dev. A native
.perfetto-trace writer is available via the optional tg4perfetto dependency.
Wire format (v1, shared ABI)
record = (globaltimer_lo32 << 32) | tag32
tag32 = (block_group << 12) | (event_id << 2) | event_type
block_group = block_idx * num_groups + group_id
event_type : 0=begin 1=end 2=instant 3=finalize
buf[0] header = (num_groups << 32) | num_blocks
Identical to the format used by TIRx/flashinfer, so traces are cross-tool compatible.
Caveats
- Zero the buffer before launch (the decoder treats 0 as empty).
- Exactly one leader thread per
(block, group)lane (two writers clobber the cursor). %globaltimer_lois 32-bit ns: ~tens-of-ns resolution and a ~4.29 s wrap.- Persistent grids stream records — cap with
max_records_per_lane(host) which is also enforced device-side viainit(..., max_records_per_lane=...). - The fence + store perturb tight pipelines; keep events coarse and compare against an
unprofiled (
ws::Profiler<false>) build.
Credits & License
Licensed under the MIT License (see LICENSE).
- 侯博涵 (Hou Bohan) — original idea and write-up: https://zhuanlan.zhihu.com/p/2054305616391304228
- Apache TVM TIRx
CudaProfiler(Apache-2.0) — wire format + host decode/Perfetto export are adapted from it: https://github.com/apache/tvm/blob/main/python/tvm/tirx/bench.py · https://tvm.apache.org/docs/tirx/native_basics/cuda/profiling.html - The CUDA/Python implementation in this repository was written by Claude Opus (Anthropic).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file warpscope-0.1.0.tar.gz.
File metadata
- Download URL: warpscope-0.1.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a40dc7fa29d355d1f18c5d495c61c67a4386c89c384d0a3bf665b85fddb225a
|
|
| MD5 |
13d667ff5142f359ef51ed9bf2acd3a2
|
|
| BLAKE2b-256 |
a81f22fc6892fa2ab0a848423fb98a3d7f2e915890763921fd276edc42d6fa05
|
File details
Details for the file warpscope-0.1.0-py3-none-any.whl.
File metadata
- Download URL: warpscope-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a23bfd8e15dd596a640e32ad2986002106085b96a7a25a8a6e958ea54de1950
|
|
| MD5 |
ab84cf4f0b721aa92c3b66f9786aacb1
|
|
| BLAKE2b-256 |
0871dc846986788bd82b056a244bba2eda86a637267ca8ddf6872da0fe66cc06
|