In-kernel %globaltimer profiler for warp-specialized CUDA kernels (Perfetto/Chrome timelines).

These details have not been verified by PyPI

Project links

Homepage

Project description

warpscope

English | 中文

Credits: the idea and design all come from 侯博涵 (Hou Bohan)'s write-up (zhihu); the wire format and the host-side decode / Perfetto export are adapted from Apache TVM TIRx CudaProfiler (bench.py, docs). The implementation in this repo was written by Claude Opus (Anthropic).

In-kernel %globaltimer profiler for warp-specialized CUDA kernels. Bracket the logical stages inside a kernel (TMA load, MMA, softmax, epilogue, ...) with start/end markers; one leader thread per logical group stamps the GPU global timer into a buffer you pass as an ordinary kernel argument. Decode it on the host into per-(block, group) durations or a Perfetto / Chrome trace to see how the producer and consumer warp-groups actually overlap — something total launch time and SM-level counters can't show.

It is not zero cost (a timer read + a global store + a block fence per event), so it is a debugging/analysis tool. Build with the profiler disabled for production.

Layout

warpscope/
  include/
    warpscope.cuh         # device header (header-only, NVRTC-safe)
    warpscope_host.hpp    # host decoder + Chrome-trace writer (header-only, pure C++)
  *.py                    # Python: Profiler buffer mgmt, decode, trace export
examples/                 # toy CUDA program (pure C++ path) + python driver
tests/                    # wire-format + decode tests

Install

pip install -e .            # core (numpy only)
pip install -e ".[torch]"   # + GPU buffer allocation
pip install -e ".[dev]"     # + pytest

Device side (CUDA C++)

#include <warpscope.cuh>
enum : uint32_t { EvWait = 0, EvWork = 1 };

__global__ void k(..., uint64_t* prof, uint32_t stride,
                  uint32_t num_groups, uint32_t num_blocks, uint32_t max_rec) {
    ws::Profiler<true> p;     // <false> compiles to a no-op
    const uint32_t warp = threadIdx.x / 32, lane = threadIdx.x % 32;

    if (warp == 0) {                                  // e.g. TMA producer = group 0
        p.init(prof, stride, /*group=*/0, num_groups, num_blocks,
               /*leader=*/lane == 0, max_rec);
        { WS_REGION(p, EvWait); /* barrier wait */ }  // RAII start/end
        { WS_REGION(p, EvWork); /* issue work  */ }
        p.finalize();
    }
    // ... other warp-groups: init with their own group id + one leader each ...
}

Build: nvcc -I"$(warpscope --include)" -arch=sm_100a my.cu

Host side — pick one

Pure C++ (header-only, no Python):

#include <warpscope_host.hpp>
std::vector<uint64_t> h(slots);                 // cudaMemcpy buffer back into h
ws::write_chrome_trace(h.data(), h.size(),
    /*events*/ {"wait", "work"},
    /*groups*/ {"tma", "umma", "utccp", "epilogue"},
    "trace.json");                              // open in chrome://tracing or perfetto

Python:

import warpscope as ws
prof = ws.Profiler(num_blocks=num_sms, num_groups=4, max_records_per_lane=64)
launch(..., prof.ptr)        # pass the device pointer
torch.cuda.synchronize()
res = prof.decode(event_names={0: "wait", 1: "work"},
                  group_names={0: "tma", 1: "umma", 2: "utccp", 3: "epilogue"})
res.print_durations()
res.to_perfetto("trace.json")   # Chrome JSON; opens in ui.perfetto.dev too

Output

The raw output is a uint64 buffer. Both host paths turn it into a Chrome Trace Event JSON file (pid = block, tid = group, ts/dur in microseconds) that opens directly in chrome://tracing and https://ui.perfetto.dev. A native .perfetto-trace writer is available via the optional tg4perfetto dependency.

Wire format (v1, shared ABI)

record = (globaltimer_lo32 << 32) | tag32
tag32  = (block_group << 12) | (event_id << 2) | event_type
block_group = block_idx * num_groups + group_id
event_type : 0=begin 1=end 2=instant 3=finalize
buf[0] header = (num_groups << 32) | num_blocks

Identical to the format used by TIRx/flashinfer, so traces are cross-tool compatible.

Caveats

Zero the buffer before launch (the decoder treats 0 as empty).
Exactly one leader thread per (block, group) lane (two writers clobber the cursor).
%globaltimer_lo is 32-bit ns: ~tens-of-ns resolution and a ~4.29 s wrap.
Persistent grids stream records — cap with max_records_per_lane (host) which is also enforced device-side via init(..., max_records_per_lane=...).
The fence + store perturb tight pipelines; keep events coarse and compare against an unprofiled (ws::Profiler<false>) build.

Credits & License

Licensed under the MIT License (see LICENSE).

侯博涵 (Hou Bohan) — original idea and write-up: https://zhuanlan.zhihu.com/p/2054305616391304228
Apache TVM TIRx CudaProfiler (Apache-2.0) — wire format + host decode/Perfetto export are adapted from it: https://github.com/apache/tvm/blob/main/python/tvm/tirx/bench.py · https://tvm.apache.org/docs/tirx/native_basics/cuda/profiling.html
The CUDA/Python implementation in this repository was written by Claude Opus (Anthropic).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warpscope-0.1.0.tar.gz (16.2 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

warpscope-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file warpscope-0.1.0.tar.gz.

File metadata

Download URL: warpscope-0.1.0.tar.gz
Upload date: Jun 27, 2026
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for warpscope-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9a40dc7fa29d355d1f18c5d495c61c67a4386c89c384d0a3bf665b85fddb225a`
MD5	`13d667ff5142f359ef51ed9bf2acd3a2`
BLAKE2b-256	`a81f22fc6892fa2ab0a848423fb98a3d7f2e915890763921fd276edc42d6fa05`

See more details on using hashes here.

File details

Details for the file warpscope-0.1.0-py3-none-any.whl.

File metadata

Download URL: warpscope-0.1.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for warpscope-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a23bfd8e15dd596a640e32ad2986002106085b96a7a25a8a6e958ea54de1950`
MD5	`ab84cf4f0b721aa92c3b66f9786aacb1`
BLAKE2b-256	`0871dc846986788bd82b056a244bba2eda86a637267ca8ddf6872da0fe66cc06`

See more details on using hashes here.

warpscope 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

warpscope

Layout

Install

Device side (CUDA C++)

Host side — pick one

Output

Wire format (v1, shared ABI)

Caveats

Credits & License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes