Python client for gpuemu GPU-less validation
Project description
gpuemu
Catch silently-wrong GPU kernels from Python — before they reach production.
gpuemu is the Python client for gpuemu,
a GPU-less correctness oracle for deep-learning kernels. It plugs into PyTorch, JAX, and
TensorFlow and validates your CUDA/Triton kernels against a high-precision fp64 reference
with op-schema-aware, adversarial inputs — finding the silent numerical bugs that
torch.allclose misses.
The problem
The industry-standard correctness check for a GPU kernel is one line:
torch.allclose(my_kernel(x), reference(x), atol=1e-5, rtol=1e-2)
One shape, one dtype, one seed. In a measured 26-op corpus that oracle accepts 9/9 LLM-style buggy kernels — tail-mask leaks, accumulator-scale bugs, missing normalisation, online-softmax rescale errors — as "correct". Those kernels then ship and run at scale: GPU-hours wasted on broken work, quality regressions that survive months of green CI.
gpuemu replaces that one-line check with an operator-aware regime that caught
100% of those bugs across 5 GPU classes with zero false positives on controls (P1).
Install
pip install gpuemu # core client
pip install gpuemu[torch] # + PyTorch adapter
pip install gpuemu[jax] # + JAX adapter
pip install gpuemu[tensorflow]
pip install gpuemu[all] # everything
The client talks to the gpuemu daemon over IPC and will start one on demand. To run the
daemon yourself, install the CLI: cargo install gpuemu.
Quick start
from gpuemu import Client
client = Client()
# Fuzz with op-schema-aware inputs and an fp64 reference oracle.
results = client.fuzz_op_client_side(
"flash_attention",
run_op=lambda inputs: my_flash_attn(inputs["q"], inputs["k"], inputs["v"]),
iterations=100,
value_distribution="adversarial", # the P3 default — 99% bug recall
)
print(f"Passed: {results.passed}/{results.total}")
A failure reports the seed, dtype, shape, and a base64 snapshot of the failing input —
re-run it byte-for-byte from any machine, with or without a GPU. The client's SeededRng is
bit-identical to the Rust daemon, so reproduction is exact across languages.
Execution modes
# 1. Client-side (recommended): your code runs the GPU op; gpuemu validates.
results = client.fuzz_op_client_side(
"matmul",
run_op=lambda i: torch.matmul(i["a"], i["b"]),
iterations=100,
)
# 2. Daemon-orchestrated: fetch cases, run them yourself, submit outputs.
for case in client.get_test_batch("my_op", count=50):
out = my_gpu_op(case["inputs"])
client.submit_output("my_op", case["inputs"], out, case["seed"])
# 3. Reproduce / minimise a known failure from its seed.
repro = client.reproduce(seed)
small = client.minimize(seed)
What you get
| Feature | What it does |
|---|---|
| fp64 reference oracle | Validates kernel output against a high-precision CPU reference per dtype |
| Op-schema-aware fuzzing | Boundary + regular + adversarial input distributions, per op |
| Calibrated tolerances | calibrate_tolerance() / get_recommended_tolerance() — p95-of-controls × 1.5 envelope (P2: 65% → 82% recall) |
| Deterministic RNG | SeededRng reproduces failures byte-for-byte, identical to the Rust daemon |
| Framework adapters | PyTorch, JAX, TensorFlow — from gpuemu.frameworks.pytorch import validate_pytorch |
| Static lint | client.lint_kernel(...) surfaces PTX/SASS register pressure and spills |
The research backing (P1–P4)
Each default is anchored to a measured study — fp64 oracle (P1: 9/9 bugs caught, 0 false positives), calibrated tolerances (P2: +23 pp recall), adversarial fuzzing (P3: 99% recall), and PTX lint (P4). See The Evidence.
Documentation
- Quick start: 5-minute first validation
- Project docs: docs.skelfresearch.com/gpuemu
- Source & issues: github.com/Skelf-Research/gpuemu
Development
pip install -e .[dev]
pytest -v # 11 tests, +7 daemon-live tests
License
Dual-licensed under MIT or Apache 2.0 at your option.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpuemu-0.1.0.tar.gz.
File metadata
- Download URL: gpuemu-0.1.0.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d2fcb9902e632eb1083a5b6a8840c46bf351cdfe53bb6130e00025165170fd0
|
|
| MD5 |
224df166a03ada20bf66e93167ec182e
|
|
| BLAKE2b-256 |
e193869d3eb3e1483c460a3a51db01e4789052c4cf6189aa0694a2c9a84fea70
|
File details
Details for the file gpuemu-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gpuemu-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a0a353b2b10d5e3849ecec1ce69ecf97a9a6dc69a1ed75e1d4f21ee82f80e9e
|
|
| MD5 |
b4d2165ac00479953aaf31f45a9aea05
|
|
| BLAKE2b-256 |
395ee7f5b1e3e43ad949eb11992f14489e246a066817b5e883503317fe0d94c9
|