Query every CUDA device attribute without profiling a kernel, and benchmark your kernels against the hardware's speed of light.
Project description
kernelmeter
Small tools for one question: is my GPU kernel actually good, and if not, what exactly is holding it back? All in one package with zero required dependencies.
kernelmeter infoprints every device attribute your GPU driver knows, plus the card's theoretical peak bandwidth and FP32 throughput. No CUDA toolkit, no torch, no kernel launch. It reads straight fromlibcuda, which is part of the NVIDIA driver.kernelmeter benchtimes your kernel, checks the output against a reference, and scores it against the roofline: the best your card could possibly do for that kernel's mix of math and memory traffic. 240 GB/s means nothing on its own; "76% of attainable" tells you how much room is left. While the kernel runs it also samples the real clocks, power and temperature through NVML, and re-scores against the ceiling the card actually held.kernelmeter rooflinedraws your card's roofline in the terminal and shows where a kernel sits on it.kernelmeter occupancyanswers "why is my occupancy 50%?" from block size, registers and shared memory, and shows which block sizes fix it.kernelmeter ceilingmeasures what the card really delivers (STREAM-style bandwidth tests plus a big FP32 matmul), because spec sheet numbers are never fully reachable.
Install
pip install kernelmeter # info only, no dependencies
pip install "kernelmeter[bench]" # adds torch for the bench harness
Or from source:
git clone https://github.com/nuemaan/kernelmeter
cd kernelmeter
pip install -e ".[bench]"
Querying your GPU
kernelmeter info
Output from a Tesla T4:
CUDA driver version : 13.0
Device 0: Tesla T4 (14.6 GiB)
compute capability : 7.5
theoretical mem bandwidth : 320.1 GB/s
theoretical FP32 peak : 8.14 TFLOP/s
theoretical fp16 tensor : 65.13 TFLOP/s (dense)
architecture (nvml) : Turing, 2560 CUDA cores
pcie link (nvml) : gen1/3 x8/16
memory in use (nvml) : 450 / 15360 MiB
ecc (nvml) : on
vbios (nvml) : 90.04.96.00.02
attribute value
------------------------------------------------ ------------
max_threads_per_block 1024
max_block_dim_x 1024
max_shared_memory_per_block 49152
warp_size 32
clock_rate_khz 1590000
... (148 attributes total)
The attribute table is read straight from the driver via
cuDeviceGetAttribute, the same values Nsight Compute shows as
device__attribute_*, but you don't need to profile a kernel to see them.
Every id is probed live, so the output matches the machine you run it on;
ids newer than the bundled name table show up as attribute_<id>.
The (nvml) lines come from a second source: NVML, the library behind
nvidia-smi, also shipped with the driver. They surface facts the driver
attribute enum doesn't have (architecture name, real CUDA core count,
PCIe link, live memory use, ECC, VBIOS) and are skipped silently if NVML
isn't present. (The gen1/3 x8/16 above is the live link: an idle T4
drops to a lower PCIe state and ramps up under load.) Add --json for
machine-readable output; the NVML block lands under devices[].nvml.
Benchmarking a kernel
Three steps.
1. Write your kernel in a file and decorate it. Anything callable from
Python works: Triton kernels, custom CUDA extensions, torch.compile
output, CuPy. Here is a complete file you can copy:
# mybench.py
import torch
import kernelmeter as km
N = 1 << 26 # work on big inputs so you measure memory, not cache
def make_args():
return (torch.randn(N, device="cuda"), torch.randn(N, device="cuda"))
@km.benchmark(
"my_add",
args=make_args, # builds fresh inputs for the run
ref=torch.add, # trusted implementation to compare with
bytes_per_call=lambda x, y: 3 * x.numel() * x.element_size(),
)
def my_add(x, y):
return x + y # <- replace with your kernel
bytes_per_call is how much memory the algorithm has to move (here: read
x, read y, write the result). The tool divides it by measured time to get
your effective bandwidth.
2. Run it.
kernelmeter bench mybench.py
3. Read the result. From a T4, with the add written as a Triton kernel:
kernel median ms GB/s TFLOP/s bound %roof vs ref correct
------------------------------------------------------------------------------------------
my_add 3.2725 246.1 - mem 76.9% 1.03x PASS
- correct - your output matched the reference. If this says FAIL, nothing else on the line matters.
- bound - whether the memory system (
mem) or the ALUs (comp) limit this kernel, decided by its arithmetic intensity (flops per byte). - %roof - how close you are to the best this card could possibly do for that intensity. This is the score to improve. Above ~80% there is little left to win.
- vs ref - speedup over the reference implementation.
Pass flops_per_call too and the roofline model places your kernel
precisely; pass peak_tflops=... if your kernel runs on tensor cores so
it gets judged against the right ceiling (kernelmeter info prints the
derived fp16/tf32 tensor peaks for your card). Raw %peak bw and
%fp32 numbers are always in the --json output.
When NVML is available (it ships with the driver) a second table follows with what the card was doing during each measurement:
telemetry sm MHz mem MHz temp power %roof@clk
-----------------------------------------------------------------------
my_add 1062/1590 5000 42C 53.1W 76.9%
%roof@clk is the same roofline score, but against the ceiling at the
clocks the card actually held. If %roof looks bad but %roof@clk is
high, your kernel is fine: the card is thermal or power limited, and no
amount of kernel work will change that. A real example, cuBLAS fp32
matmul on a 70 W T4:
kernel median ms GB/s TFLOP/s bound %roof vs ref correct
----------------------------------------------------------------------------------------
fp32_matmul 32.0354 6.3 4.29 comp 52.7% - -
telemetry sm MHz mem MHz temp power %roof@clk
-----------------------------------------------------------------------
fp32_matmul 877/1590 5000 46C 70.4W 95.5%
53% of peak looks like a kernel problem. The telemetry shows it is not: the card hit its 70 W power limit and dropped to 877 MHz, and at those clocks the kernel was at 95.5% of what the silicon could deliver. cuBLAS was never the problem.
Timing uses CUDA events with warmup, and the L2 cache is flushed between
iterations so small workloads can't fake huge bandwidth numbers from
cache hits. Pass --no-flush-l2 if you want cache-hot numbers.
The examples folder has ready-to-run starting points: two Triton kernels (vector add, fused softmax) and a compute-bound matmul.
Seeing the roofline
kernelmeter roofline --ai 0.33 # mark a kernel at 0.33 flop/byte
peak bandwidth : 320.0 GB/s
peak compute : 8.14 TFLOP/s
ridge point : 25.4 flop/byte
8.14 TF/s | **x*****************
| ***
| ****
| ***
| ****
| ****
| ***
| ****
| ***
| *o**
| ****
|**
+----------------------------------------------------------
2^-3 2^0 2^3 2^6
at 0.33 flop/byte the kernel is memory-bound; attainable: 0.11 TFLOP/s
The o is your kernel, the x is the ridge point. Left of the ridge,
more FLOPs are free: the memory traffic is the bill you are paying
anyway. That is the whole argument for kernel fusion, in one picture.
No GPU around? --peak-bw and --peak-tflops let you draw any card.
--tensor swaps in the fp16 tensor-core roof, which moves the ridge
point far to the left; that picture explains why tensor-core kernels
are almost always memory-bound.
Why is my occupancy low?
Feed it what ptxas -v or Nsight Compute tells you about your kernel:
kernelmeter occupancy --block 256 --regs 64 --smem 8192 --cc 8.6
occupancy for compute capability 8.6
block=256 regs/thread=64 smem/block=8192
occupancy : 66.7% (32/48 warps per SM)
blocks per SM: 4
limited by : registers
block size 64 128 192 256 384 512 768 1024
occupancy 46% 67% 62% 67% 50% 67% 50% 67%
It names the resource that is capping you and sweeps block sizes so you
can see if a different launch shape helps. Works with no GPU present:
pass --cc for any architecture from 7.0 (Volta) to 12.x (Blackwell).
What can the card really do?
Theoretical peaks assume the max boost clock, which the card cannot hold. Measure the real ceilings once and judge your kernels against those:
kernelmeter ceiling
This runs the four STREAM kernels (copy, scale, add, triad) and a large TF32-disabled matmul. On the same T4:
test median ms GB/s TFLOP/s % of theoretical
---------------------------------------------------------------
copy 1.1495 233.5 - 73.0%
scale 1.1674 230.0 - 71.8%
add 1.6903 238.2 - 74.4%
triad 1.6878 238.6 - 74.5%
fp32 matmul 3.5563 - 4.83 59.3%
measured bandwidth ceiling: 238.6 GB/s (use this as the honest 100%
for memory-bound kernels)
This reframes the bench results above: the vector add that scored "75.3% of theoretical" was moving 241 GB/s on a card whose memory system tops out at 238.6 GB/s in practice. It was already saturated. Without the measured ceiling you would have kept optimizing a finished kernel.
Catching regressions
kernelmeter bench mykernels.py --save baseline.json
# ...edit your kernels...
kernelmeter bench mykernels.py --compare baseline.json
The compare run prints a delta column per kernel and exits non-zero if anything got more than 5% slower, so it slots straight into CI.
A workflow that works
If you are learning CUDA (say, working through the PMPP book) and wondering whether your kernels are any good:
- Run
kernelmeter infoandkernelmeter ceilingonce. Now you know your card's real limits. - Benchmark your kernel with
bytes_per_callandflops_per_callset. Theboundcolumn tells you which resource you are fighting. %roofunder ~60%? If the kernel is memory-bound, checkoccupancyfirst: too few warps in flight cannot hide memory latency. Then open Nsight Compute. Now you know what you are looking for, instead of staring at forty unfamiliar counters.%roofabove ~80%? Stop optimizing this kernel. The next win is algorithmic (fuse it with a neighbor, move less data), and the roofline chart shows why: left of the ridge, FLOPs are free.
Caveats
- Theoretical peaks are computed from the max boost clock the driver
reports. Sustained clocks under load are lower; the telemetry table
and
kernelmeter ceilingboth show what you can actually reach. - The tensor-core peaks are dense rates with fp16 accumulate. GeForce
cards run tensor cores at half rate when accumulating in fp32, and
sparse rates are double; pass
peak_tflops=...when those apply. - The occupancy command implements the standard calculator model. Real occupancy can differ (launch bounds, driver decisions); confirm with Nsight Compute when it matters.
- Attribute names above id 143 are best-effort against the CUDA 12.x headers. Values are always read live from your driver. PRs that extend the name table are welcome.
Development
pip install -e ".[dev]"
pytest
The tests fake the driver, so they run anywhere, no GPU needed. CI runs
them on plain GitHub runners. For an end-to-end check on a real GPU there
is a Modal script: modal run scripts/modal_gpu_test.py.
The numbers in this README come from that script on a T4.
Releases are tag-driven: bump the version in pyproject.toml, add a
CHANGELOG.md entry, push a v* tag. CI tests, builds and
publishes to PyPI through trusted publishing.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kernelmeter-0.4.1.tar.gz.
File metadata
- Download URL: kernelmeter-0.4.1.tar.gz
- Upload date:
- Size: 39.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ed633888698f8b2febe030753cd0e8ac2a2bcb3e7b820a3303f498cb9715807
|
|
| MD5 |
c9cc85776fdf52c4f3a3e88163a9b76e
|
|
| BLAKE2b-256 |
309dfb9f2d54df3616845a319e1e31bae02b736445ebf598a7e0a3976279d4ce
|
Provenance
The following attestation bundles were made for kernelmeter-0.4.1.tar.gz:
Publisher:
release.yml on nuemaan/kernelmeter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kernelmeter-0.4.1.tar.gz -
Subject digest:
1ed633888698f8b2febe030753cd0e8ac2a2bcb3e7b820a3303f498cb9715807 - Sigstore transparency entry: 1809878277
- Sigstore integration time:
-
Permalink:
nuemaan/kernelmeter@663afb07d05e43090f8059b099b79f60a904d236 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/nuemaan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@663afb07d05e43090f8059b099b79f60a904d236 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kernelmeter-0.4.1-py3-none-any.whl.
File metadata
- Download URL: kernelmeter-0.4.1-py3-none-any.whl
- Upload date:
- Size: 32.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
164ff4f5e435beaff02c790a2a74d3ac130c36f69946e5c2fe8a3ebd3a4ca585
|
|
| MD5 |
8d0e3579f8e472aa3e303e02acd771d9
|
|
| BLAKE2b-256 |
2b24584c513879d5439f8556451485da3a69ecc1358799ce0b801b4df92466c0
|
Provenance
The following attestation bundles were made for kernelmeter-0.4.1-py3-none-any.whl:
Publisher:
release.yml on nuemaan/kernelmeter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kernelmeter-0.4.1-py3-none-any.whl -
Subject digest:
164ff4f5e435beaff02c790a2a74d3ac130c36f69946e5c2fe8a3ebd3a4ca585 - Sigstore transparency entry: 1809878285
- Sigstore integration time:
-
Permalink:
nuemaan/kernelmeter@663afb07d05e43090f8059b099b79f60a904d236 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/nuemaan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@663afb07d05e43090f8059b099b79f60a904d236 -
Trigger Event:
push
-
Statement type: