Skip to main content

In-kernel %globaltimer profiler for warp-specialized CUDA kernels (Perfetto/Chrome timelines).

Project description

warpscope

English | 中文

Credits: the idea and design all come from 侯博涵 (Hou Bohan)'s write-up (zhihu); the wire format and the host-side decode / Perfetto export are adapted from Apache TVM TIRx CudaProfiler (bench.py, docs). The implementation in this repo was written by Claude Opus (Anthropic).

In-kernel %globaltimer profiler for warp-specialized CUDA kernels. Bracket the logical stages inside a kernel (TMA load, MMA, softmax, epilogue, ...) with start/end markers; one leader thread per logical group stamps the GPU global timer into a buffer you pass as an ordinary kernel argument. Decode it on the host into per-(block, group) durations or a Perfetto / Chrome trace to see how the producer and consumer warp-groups actually overlap — something total launch time and SM-level counters can't show.

It is not zero cost (a timer read + a global store + a block fence per event), so it is a debugging/analysis tool. Build with the profiler disabled for production.

Layout

warpscope/
  include/
    warpscope.cuh         # device header (header-only, NVRTC-safe)
    warpscope_host.hpp    # host decoder + Chrome-trace writer (header-only, pure C++)
  *.py                    # Python: Profiler buffer mgmt, decode, trace export
examples/                 # toy CUDA program (pure C++ path) + python driver
tests/                    # wire-format + decode tests

Install

pip install -e .            # core (numpy only)
pip install -e ".[torch]"   # + GPU buffer allocation
pip install -e ".[dev]"     # + pytest

Device side (CUDA C++)

#include <warpscope.cuh>
enum : uint32_t { EvWait = 0, EvWork = 1 };

__global__ void k(..., uint64_t* prof, uint32_t stride,
                  uint32_t num_groups, uint32_t num_blocks, uint32_t max_rec) {
    ws::Profiler<true> p;     // <false> compiles to a no-op
    const uint32_t warp = threadIdx.x / 32, lane = threadIdx.x % 32;

    if (warp == 0) {                                  // e.g. TMA producer = group 0
        p.init(prof, stride, /*group=*/0, num_groups, num_blocks,
               /*leader=*/lane == 0, max_rec);
        { WS_REGION(p, EvWait); /* barrier wait */ }  // RAII start/end
        { WS_REGION(p, EvWork); /* issue work  */ }
        p.finalize();
    }
    // ... other warp-groups: init with their own group id + one leader each ...
}

Build: nvcc -I"$(warpscope --include)" -arch=sm_100a my.cu

Host side — pick one

Pure C++ (header-only, no Python):

#include <warpscope_host.hpp>
std::vector<uint64_t> h(slots);                 // cudaMemcpy buffer back into h
ws::write_chrome_trace(h.data(), h.size(),
    /*events*/ {"wait", "work"},
    /*groups*/ {"tma", "umma", "utccp", "epilogue"},
    "trace.json");                              // open in chrome://tracing or perfetto

Python:

import warpscope as ws
prof = ws.Profiler(num_blocks=num_sms, num_groups=4, max_records_per_lane=64)
launch(..., prof.ptr)        # pass the device pointer
torch.cuda.synchronize()
res = prof.decode(event_names={0: "wait", 1: "work"},
                  group_names={0: "tma", 1: "umma", 2: "utccp", 3: "epilogue"})
res.print_durations()
res.to_perfetto("trace.json")   # Chrome JSON; opens in ui.perfetto.dev too

Output

The raw output is a uint64 buffer. Both host paths turn it into a Chrome Trace Event JSON file (pid = block, tid = group, ts/dur in microseconds) that opens directly in chrome://tracing and https://ui.perfetto.dev. A native .perfetto-trace writer is available via the optional tg4perfetto dependency.

Wire format (v1, shared ABI)

record = (globaltimer_lo32 << 32) | tag32
tag32  = (block_group << 12) | (event_id << 2) | event_type
block_group = block_idx * num_groups + group_id
event_type : 0=begin 1=end 2=instant 3=finalize
buf[0] header = (num_groups << 32) | num_blocks

Identical to the format used by TIRx/flashinfer, so traces are cross-tool compatible.

Caveats

  • Zero the buffer before launch (the decoder treats 0 as empty).
  • Exactly one leader thread per (block, group) lane (two writers clobber the cursor).
  • %globaltimer_lo is 32-bit ns: ~tens-of-ns resolution and a ~4.29 s wrap.
  • Persistent grids stream records — cap with max_records_per_lane (host) which is also enforced device-side via init(..., max_records_per_lane=...).
  • The fence + store perturb tight pipelines; keep events coarse and compare against an unprofiled (ws::Profiler<false>) build.

Credits & License

Licensed under the MIT License (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warpscope-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warpscope-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file warpscope-0.1.0.tar.gz.

File metadata

  • Download URL: warpscope-0.1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for warpscope-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a40dc7fa29d355d1f18c5d495c61c67a4386c89c384d0a3bf665b85fddb225a
MD5 13d667ff5142f359ef51ed9bf2acd3a2
BLAKE2b-256 a81f22fc6892fa2ab0a848423fb98a3d7f2e915890763921fd276edc42d6fa05

See more details on using hashes here.

File details

Details for the file warpscope-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: warpscope-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for warpscope-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a23bfd8e15dd596a640e32ad2986002106085b96a7a25a8a6e958ea54de1950
MD5 ab84cf4f0b721aa92c3b66f9786aacb1
BLAKE2b-256 0871dc846986788bd82b056a244bba2eda86a637267ca8ddf6872da0fe66cc06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page