Skip to main content

Query every CUDA device attribute without profiling a kernel, and benchmark your kernels against the hardware's speed of light.

Project description

kernelmeter

PyPI CI Python License: MIT

Small tools for one question: is my GPU kernel actually good, and if not, what exactly is holding it back? All in one package with zero required dependencies.

  • kernelmeter info prints every device attribute your GPU driver knows, plus the card's theoretical peak bandwidth and FP32 throughput. No CUDA toolkit, no torch, no kernel launch. It reads straight from libcuda, which is part of the NVIDIA driver.
  • kernelmeter bench times your kernel, checks the output against a reference, and scores it against the roofline: the best your card could possibly do for that kernel's mix of math and memory traffic. 240 GB/s means nothing on its own; "76% of attainable" tells you how much room is left. While the kernel runs it also samples the real clocks, power and temperature through NVML, and re-scores against the ceiling the card actually held.
  • kernelmeter roofline draws your card's roofline in the terminal and shows where a kernel sits on it.
  • kernelmeter occupancy answers "why is my occupancy 50%?" from block size, registers and shared memory, and shows which block sizes fix it.
  • kernelmeter ceiling measures what the card really delivers (STREAM-style bandwidth tests plus a big FP32 matmul), because spec sheet numbers are never fully reachable.

Install

pip install kernelmeter           # info only, no dependencies
pip install "kernelmeter[bench]"  # adds torch for the bench harness

Or from source:

git clone https://github.com/nuemaan/kernelmeter
cd kernelmeter
pip install -e ".[bench]"

Querying your GPU

kernelmeter info

Output from a Tesla T4:

CUDA driver version : 13.0

Device 0: Tesla T4 (14.6 GiB)
  compute capability        : 7.5
  theoretical mem bandwidth : 320.1 GB/s
  theoretical FP32 peak     : 8.14 TFLOP/s
  theoretical fp16 tensor   : 65.13 TFLOP/s (dense)
  architecture (nvml)       : Turing, 2560 CUDA cores
  pcie link (nvml)          : gen1/3 x8/16
  memory in use (nvml)      : 450 / 15360 MiB
  ecc (nvml)                : on
  vbios (nvml)              : 90.04.96.00.02

  attribute                                        value
  ------------------------------------------------ ------------
  max_threads_per_block                            1024
  max_block_dim_x                                  1024
  max_shared_memory_per_block                      49152
  warp_size                                        32
  clock_rate_khz                                   1590000
  ...                                              (148 attributes total)

The attribute table is read straight from the driver via cuDeviceGetAttribute, the same values Nsight Compute shows as device__attribute_*, but you don't need to profile a kernel to see them. Every id is probed live, so the output matches the machine you run it on; ids newer than the bundled name table show up as attribute_<id>.

The (nvml) lines come from a second source: NVML, the library behind nvidia-smi, also shipped with the driver. They surface facts the driver attribute enum doesn't have (architecture name, real CUDA core count, PCIe link, live memory use, ECC, VBIOS) and are skipped silently if NVML isn't present. (The gen1/3 x8/16 above is the live link: an idle T4 drops to a lower PCIe state and ramps up under load.) Add --json for machine-readable output; the NVML block lands under devices[].nvml.

Benchmarking a kernel

Three steps.

1. Write your kernel in a file and decorate it. Anything callable from Python works: Triton kernels, custom CUDA extensions, torch.compile output, CuPy. Here is a complete file you can copy:

# mybench.py
import torch
import kernelmeter as km

N = 1 << 26  # work on big inputs so you measure memory, not cache

def make_args():
    return (torch.randn(N, device="cuda"), torch.randn(N, device="cuda"))

@km.benchmark(
    "my_add",
    args=make_args,                 # builds fresh inputs for the run
    ref=torch.add,                  # trusted implementation to compare with
    bytes_per_call=lambda x, y: 3 * x.numel() * x.element_size(),
)
def my_add(x, y):
    return x + y                    # <- replace with your kernel

bytes_per_call is how much memory the algorithm has to move (here: read x, read y, write the result). The tool divides it by measured time to get your effective bandwidth.

2. Run it.

kernelmeter bench mybench.py

3. Read the result. From a T4, with the add written as a Triton kernel:

kernel                    median ms      GB/s   TFLOP/s  bound    %roof   vs ref  correct
------------------------------------------------------------------------------------------
my_add                       3.2725     246.1         -    mem    76.9%    1.03x     PASS
  • correct - your output matched the reference. If this says FAIL, nothing else on the line matters.
  • bound - whether the memory system (mem) or the ALUs (comp) limit this kernel, decided by its arithmetic intensity (flops per byte).
  • %roof - how close you are to the best this card could possibly do for that intensity. This is the score to improve. Above ~80% there is little left to win.
  • vs ref - speedup over the reference implementation.

Pass flops_per_call too and the roofline model places your kernel precisely; pass peak_tflops=... if your kernel runs on tensor cores so it gets judged against the right ceiling (kernelmeter info prints the derived fp16/tf32 tensor peaks for your card). Raw %peak bw and %fp32 numbers are always in the --json output.

When NVML is available (it ships with the driver) a second table follows with what the card was doing during each measurement:

telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
-----------------------------------------------------------------------
my_add                    1062/1590      5000    42C   53.1W      76.9%

%roof@clk is the same roofline score, but against the ceiling at the clocks the card actually held. If %roof looks bad but %roof@clk is high, your kernel is fine: the card is thermal or power limited, and no amount of kernel work will change that. A real example, cuBLAS fp32 matmul on a 70 W T4:

kernel                    median ms      GB/s   TFLOP/s  bound   %roof   vs ref  correct
----------------------------------------------------------------------------------------
fp32_matmul                 32.0354       6.3      4.29   comp   52.7%        -        -

telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
-----------------------------------------------------------------------
fp32_matmul                877/1590      5000    46C   70.4W      95.5%

53% of peak looks like a kernel problem. The telemetry shows it is not: the card hit its 70 W power limit and dropped to 877 MHz, and at those clocks the kernel was at 95.5% of what the silicon could deliver. cuBLAS was never the problem.

Timing uses CUDA events with warmup, and the L2 cache is flushed between iterations so small workloads can't fake huge bandwidth numbers from cache hits. Pass --no-flush-l2 if you want cache-hot numbers.

The examples folder has ready-to-run starting points: two Triton kernels (vector add, fused softmax) and a compute-bound matmul.

Seeing the roofline

kernelmeter roofline --ai 0.33        # mark a kernel at 0.33 flop/byte
  peak bandwidth : 320.0 GB/s
  peak compute   : 8.14 TFLOP/s
  ridge point    : 25.4 flop/byte

8.14 TF/s |                                      **x*****************
          |                                   ***
          |                               ****
          |                            ***
          |                        ****
          |                    ****
          |                 ***
          |             ****
          |          ***
          |      *o**
          |  ****
          |**
          +----------------------------------------------------------
           2^-3            2^0            2^3             2^6

at 0.33 flop/byte the kernel is memory-bound; attainable: 0.11 TFLOP/s

The o is your kernel, the x is the ridge point. Left of the ridge, more FLOPs are free: the memory traffic is the bill you are paying anyway. That is the whole argument for kernel fusion, in one picture. No GPU around? --peak-bw and --peak-tflops let you draw any card. --tensor swaps in the fp16 tensor-core roof, which moves the ridge point far to the left; that picture explains why tensor-core kernels are almost always memory-bound.

Why is my occupancy low?

Feed it what ptxas -v or Nsight Compute tells you about your kernel:

kernelmeter occupancy --block 256 --regs 64 --smem 8192 --cc 8.6
occupancy for compute capability 8.6
  block=256 regs/thread=64 smem/block=8192

  occupancy    : 66.7% (32/48 warps per SM)
  blocks per SM: 4
  limited by   : registers

  block size      64    128    192    256    384    512    768   1024
  occupancy      46%    67%    62%    67%    50%    67%    50%    67%

It names the resource that is capping you and sweeps block sizes so you can see if a different launch shape helps. Works with no GPU present: pass --cc for any architecture from 7.0 (Volta) to 12.x (Blackwell).

What can the card really do?

Theoretical peaks assume the max boost clock, which the card cannot hold. Measure the real ceilings once and judge your kernels against those:

kernelmeter ceiling

This runs the four STREAM kernels (copy, scale, add, triad) and a large TF32-disabled matmul. On the same T4:

test            median ms      GB/s   TFLOP/s  % of theoretical
---------------------------------------------------------------
copy               1.1495     233.5         -             73.0%
scale              1.1674     230.0         -             71.8%
add                1.6903     238.2         -             74.4%
triad              1.6878     238.6         -             74.5%
fp32 matmul        3.5563         -      4.83             59.3%

measured bandwidth ceiling: 238.6 GB/s (use this as the honest 100%
for memory-bound kernels)

This reframes the bench results above: the vector add that scored "75.3% of theoretical" was moving 241 GB/s on a card whose memory system tops out at 238.6 GB/s in practice. It was already saturated. Without the measured ceiling you would have kept optimizing a finished kernel.

Catching regressions

kernelmeter bench mykernels.py --save baseline.json
# ...edit your kernels...
kernelmeter bench mykernels.py --compare baseline.json

The compare run prints a delta column per kernel and exits non-zero if anything got more than 5% slower, so it slots straight into CI.

A workflow that works

If you are learning CUDA (say, working through the PMPP book) and wondering whether your kernels are any good:

  1. Run kernelmeter info and kernelmeter ceiling once. Now you know your card's real limits.
  2. Benchmark your kernel with bytes_per_call and flops_per_call set. The bound column tells you which resource you are fighting.
  3. %roof under ~60%? If the kernel is memory-bound, check occupancy first: too few warps in flight cannot hide memory latency. Then open Nsight Compute. Now you know what you are looking for, instead of staring at forty unfamiliar counters.
  4. %roof above ~80%? Stop optimizing this kernel. The next win is algorithmic (fuse it with a neighbor, move less data), and the roofline chart shows why: left of the ridge, FLOPs are free.

Caveats

  • Theoretical peaks are computed from the max boost clock the driver reports. Sustained clocks under load are lower; the telemetry table and kernelmeter ceiling both show what you can actually reach.
  • The tensor-core peaks are dense rates with fp16 accumulate. GeForce cards run tensor cores at half rate when accumulating in fp32, and sparse rates are double; pass peak_tflops=... when those apply.
  • The occupancy command implements the standard calculator model. Real occupancy can differ (launch bounds, driver decisions); confirm with Nsight Compute when it matters.
  • Attribute names above id 143 are best-effort against the CUDA 12.x headers. Values are always read live from your driver. PRs that extend the name table are welcome.

Development

pip install -e ".[dev]"
pytest

The tests fake the driver, so they run anywhere, no GPU needed. CI runs them on plain GitHub runners. For an end-to-end check on a real GPU there is a Modal script: modal run scripts/modal_gpu_test.py. The numbers in this README come from that script on a T4.

Releases are tag-driven: bump the version in pyproject.toml, add a CHANGELOG.md entry, push a v* tag. CI tests, builds and publishes to PyPI through trusted publishing.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kernelmeter-0.4.0.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kernelmeter-0.4.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file kernelmeter-0.4.0.tar.gz.

File metadata

  • Download URL: kernelmeter-0.4.0.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kernelmeter-0.4.0.tar.gz
Algorithm Hash digest
SHA256 43d85ba3a7b97dcdfe9ae7460b8ffe1c3600dfb8b12c80dc1aba1e1af0f98fa7
MD5 2dadbc6c1b450d3fd310ef74bdece3c8
BLAKE2b-256 9eda389c4a1e8fefdf5a8c5f4d906873e4c56dbc9bbd8c4c4880885e30b29897

See more details on using hashes here.

Provenance

The following attestation bundles were made for kernelmeter-0.4.0.tar.gz:

Publisher: release.yml on nuemaan/kernelmeter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kernelmeter-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: kernelmeter-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kernelmeter-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b25483ed29880691b05b94ef53d24a409322298b40ace2ccfce43c643be08c7
MD5 e7cfb659141c5492277bc5df574ec588
BLAKE2b-256 a2b0697caff1e887a44d4b1605b4f1351c21db12bd4dea733e6571675c5404d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for kernelmeter-0.4.0-py3-none-any.whl:

Publisher: release.yml on nuemaan/kernelmeter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page