GPU kernel benchmarking utilities

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- GPU :: NVIDIA CUDA :: 13
Operating System
- POSIX :: Linux
Programming Language
- C++
- Python :: 3

Project description

PyGPUBench

Utilities for benchmarking low-latency CUDA kernels in an adversarial setting. Contrary to many existing benchmarking tools, which generally assume a cooperative kernel that can be tested and benchmarked independently, this library tries to defend against kernels that try to exploit benchmarking flaws to receive higher scores.

Usage

To benchmark a kernel, two ingredients are needed:

The qualified name of the kernel function. It is important that the testing script itself does not import the kernel function, as this implies executing untrusted code.
A function that generates test/benchmark inputs. This function takes keyword arguments of configuration parameters, as well as the reserved argument seed to randomize the problem. It returns two tuples: The first contains the inputs for the kernel and will be used to call the kernel function, and the second contains the expected output and the required absolute and relative tolerance.

import torch
import pygpubench

def generate_input(**kwargs):
    ...

def reference_kernel(args):
    ...

def generate_test_case(*, seed, **kwargs):
    x, y = generate_input(**kwargs, seed=seed)
    expected = torch.empty_like(y)
    reference_kernel((expected, x))
    return (y, x), (expected, 1e-6, 1e-6)


res = pygpubench.do_bench_isolated("submission.kernel", generate_test_case,  {"size": 1024}, 100, 5, discard=True)
print("❌" if res.errors else "✅", pygpubench.basic_stats(res.time_us))

For the full example see grayscale.py

Implementation

Unfortunately, any benchmarking tool written in python is inherently vulnerable to monkeypatching and inpect-based manipulation of its variables by its callees. Therefore, PyGPUBench implements its main benchmarking logic in a compiled C++ extension. While this still leaves vulnerabilities - the code is running in the same address space, after all – it makes attacks require much more sophistication. Running in a separate process fundamentally clashes with the desire to benchmark very short kernels; cuda events must be recorded in the same process as the kernel. Fortunately, we can assume that a reward-hacking LLM is still rather unlikely to produce a compiled extension that runs sophisticated low-level exploits.

Note that, as soon as any user code is executed, the entire python runtime becomes untrustworthy. Consequently, benchmark results are not returned to python, but instead written to a file. The name of this file is passed as an argument to the benchmarking function, and the file is unlinked before the user code is called, making it impossible to reopen this file. The do_bench_isolated function is designed to streamline this process: It automates creating the temporary file, spawning a new python process to handle benchmarking and reading the results back into python (the original, untainted process).

Thus, the library provides two main interfaces to benchmarking: do_bench_impl runs benchmarking directly in the current process, do_bench_isolated runs it in a separate process and automaticallly handles I/O through a temporary file.

Additional measures to mitigate benchmark cheating are that benchmark inputs are generated before any benchmark is run, but then moved to a GPU memory location unknown to torch (allocated directly with cudaMalloc in C++). Only before the actual kernel is launched do we copy the inputs back to their original locations. Problematically, this would put the inputs into L2 cache, which we want to avoid. This means that between the copy and the kernel launch, there has to be another kernel that clears the L2 cache, opening a window of opportunity for cheating. To minimize the duration of vulnerability, we put a small fraction of random canaries into the input data, that is, a subset of memory location contains wrong data. Only after L2 clearing do we fix up these values; this pulls them into L2 cache, but since they make up less than 1% of the total data, we consider this an acceptable tradeoff.

Similarly, after the kernel is finished, we directly launch the testing kernel with a programmatically-dependent launch, again to minimize the window of opportunity for cheating by writing results from a different stream. This could have a small effect on performance, as during the tail of the user kernel blocks of the test kernel are already put on the SMs and generate memory traffic. In the checking kernel, the order in which blocks are checked is randomized, so that it is not a viable strategy to only write the later blocks of the result from an unsynchronized stream.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Environment
- GPU :: NVIDIA CUDA :: 13
Operating System
- POSIX :: Linux
Programming Language
- C++
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.3

Apr 30, 2026

0.0.2

Feb 21, 2026

0.0.1

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpubench-0.0.3.tar.gz (15.8 MB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl (2.3 MB view details)

Uploaded Apr 30, 2026 CPython 3.12+manylinux: glibc 2.38+ x86-64

File details

Details for the file pygpubench-0.0.3.tar.gz.

File metadata

Download URL: pygpubench-0.0.3.tar.gz
Upload date: Apr 30, 2026
Size: 15.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pygpubench-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`7fe4a53bf4079e7a6821866052eb9270b3c682a602e1f7db70de8b107a39a9c5`
MD5	`ecfc46c0eb74bfcfd097c1a588bc5e7d`
BLAKE2b-256	`5882c5483b665e22faf59eeed81911e38cabbef59f3b4b7f12357a37cabd109a`

See more details on using hashes here.

File details

Details for the file pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl.

File metadata

Download URL: pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl
Upload date: Apr 30, 2026
Size: 2.3 MB
Tags: CPython 3.12+, manylinux: glibc 2.38+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl
Algorithm	Hash digest
SHA256	`d7cd396182f7cc62346352bd4be129520fd72516aad34e045243b2749690b120`
MD5	`c55d8cf3f39affb9bc04908f427eae41`
BLAKE2b-256	`314111ce06abf99087f7c4626872172a0b6e719c92cbacfc3ff92bf056ababf3`

See more details on using hashes here.

pygpubench 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyGPUBench

Usage

Implementation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes