Skip to main content

GPU kernel benchmarking utilities

Project description

PyPI version

PyGPUBench

Utilities for benchmarking low-latency CUDA kernels in an adversarial setting. Contrary to many existing benchmarking tools, which generally assume a cooperative kernel that can be tested and benchmarked independently, this library tries to defend against kernels that try to exploit benchmarking flaws to receive higher scores.

Usage

To benchmark a kernel, two ingredients are needed:

  1. The qualified name of the kernel function. It is important that the testing script itself does not import the kernel function, as this implies executing untrusted code.
  2. A function that generates test/benchmark inputs. This function takes keyword arguments of configuration parameters, as well as the reserved argument seed to randomize the problem. It returns two tuples: The first contains the inputs for the kernel and will be used to call the kernel function, and the second contains the expected output and the required absolute and relative tolerance.
import torch
import pygpubench

def generate_input(**kwargs):
    ...

def reference_kernel(args):
    ...

def generate_test_case(*, seed, **kwargs):
    x, y = generate_input(**kwargs, seed=seed)
    expected = torch.empty_like(y)
    reference_kernel((expected, x))
    return (y, x), (expected, 1e-6, 1e-6)


res = pygpubench.do_bench_isolated("submission.kernel", generate_test_case,  {"size": 1024}, 100, 5, discard=True)
print("❌" if res.errors else "✅", pygpubench.basic_stats(res.time_us))

For the full example see grayscale.py

Implementation

Unfortunately, any benchmarking tool written in python is inherently vulnerable to monkeypatching and inpect-based manipulation of its variables by its callees. Therefore, PyGPUBench implements its main benchmarking logic in a compiled C++ extension. While this still leaves vulnerabilities - the code is running in the same address space, after all – it makes attacks require much more sophistication. Running in a separate process fundamentally clashes with the desire to benchmark very short kernels; cuda events must be recorded in the same process as the kernel. Fortunately, we can assume that a reward-hacking LLM is still rather unlikely to produce a compiled extension that runs sophisticated low-level exploits.

Note that, as soon as any user code is executed, the entire python runtime becomes untrustworthy. Consequently, benchmark results are not returned to python, but instead written to a file. The name of this file is passed as an argument to the benchmarking function, and the file is unlinked before the user code is called, making it impossible to reopen this file. The do_bench_isolated function is designed to streamline this process: It automates creating the temporary file, spawning a new python process to handle benchmarking and reading the results back into python (the original, untainted process).

Thus, the library provides two main interfaces to benchmarking: do_bench_impl runs benchmarking directly in the current process, do_bench_isolated runs it in a separate process and automaticallly handles I/O through a temporary file.

Additional measures to mitigate benchmark cheating are that benchmark inputs are generated before any benchmark is run, but then moved to a GPU memory location unknown to torch (allocated directly with cudaMalloc in C++). Only before the actual kernel is launched do we copy the inputs back to their original locations. Problematically, this would put the inputs into L2 cache, which we want to avoid. This means that between the copy and the kernel launch, there has to be another kernel that clears the L2 cache, opening a window of opportunity for cheating. To minimize the duration of vulnerability, we put a small fraction of random canaries into the input data, that is, a subset of memory location contains wrong data. Only after L2 clearing do we fix up these values; this pulls them into L2 cache, but since they make up less than 1% of the total data, we consider this an acceptable tradeoff.

Similarly, after the kernel is finished, we directly launch the testing kernel with a programmatically-dependent launch, again to minimize the window of opportunity for cheating by writing results from a different stream. This could have a small effect on performance, as during the tail of the user kernel blocks of the test kernel are already put on the SMs and generate memory traffic. In the checking kernel, the order in which blocks are checked is randomized, so that it is not a viable strategy to only write the later blocks of the result from an unsynchronized stream.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpubench-0.0.3.tar.gz (15.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.38+ x86-64

File details

Details for the file pygpubench-0.0.3.tar.gz.

File metadata

  • Download URL: pygpubench-0.0.3.tar.gz
  • Upload date:
  • Size: 15.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pygpubench-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7fe4a53bf4079e7a6821866052eb9270b3c682a602e1f7db70de8b107a39a9c5
MD5 ecfc46c0eb74bfcfd097c1a588bc5e7d
BLAKE2b-256 5882c5483b665e22faf59eeed81911e38cabbef59f3b4b7f12357a37cabd109a

See more details on using hashes here.

File details

Details for the file pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for pygpubench-0.0.3-cp312-abi3-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 d7cd396182f7cc62346352bd4be129520fd72516aad34e045243b2749690b120
MD5 c55d8cf3f39affb9bc04908f427eae41
BLAKE2b-256 314111ce06abf99087f7c4626872172a0b6e719c92cbacfc3ff92bf056ababf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page