A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API
Project description
PyGPUkit — Lightweight GPU Runtime for Python
A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.
Overview
PyGPUkit is a lightweight GPU runtime for Python that provides:
- Single-binary distribution — works with just GPU drivers, no CUDA Toolkit needed
- Rust-powered scheduler with admission control, QoS, and resource partitioning
- NVRTC JIT (optional) for custom kernel compilation
- A NumPy-like
GPUArraytype - Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)
PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.
Note: PyGPUkit is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads where full ML frameworks are overkill.
What's New in v0.2.5
FP16 / BF16 Support
| Feature | Description |
|---|---|
| FP16 (float16) | Half-precision floating point |
| BF16 (bfloat16) | Brain floating point (better dynamic range) |
| FP32 Accumulation | Numerical stability via FP32 intermediate |
| Type Conversion | astype() for seamless dtype conversion |
import pygpukit as gpk
import numpy as np
# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b # FP16 matmul
# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16 # BF16 matmul
result = c_bf16.astype(gpk.float32) # Convert back to FP32
Reduction Operations
| Operation | Description |
|---|---|
gpk.sum(a) |
Sum of all elements |
gpk.mean(a) |
Mean of all elements |
gpk.max(a) |
Maximum element |
Operator Overloads
c = a + b # Element-wise add
c = a - b # Element-wise subtract
c = a * b # Element-wise multiply
c = a / b # Element-wise divide
c = a @ b # Matrix multiplication
What's New in v0.2.4
Single-Binary Distribution
| Feature | Description |
|---|---|
| Driver-only mode | Only nvcuda.dll (GPU driver) required |
| Dynamic NVRTC | JIT loaded at runtime, optional |
| No cudart dependency | Eliminated CUDA Runtime dependency |
| Smaller wheel | No bundled DLLs |
import pygpukit as gp
# Works with just GPU drivers!
print(f"CUDA: {gp.is_cuda_available()}") # True (if GPU driver installed)
print(f"NVRTC: {gp.is_nvrtc_available()}") # True (if CUDA Toolkit installed)
print(f"NVRTC Path: {gp.get_nvrtc_path()}") # Path to NVRTC DLL (if available)
TF32 TensorCore GEMM
| Feature | Description |
|---|---|
| PTX mma.sync | Direct TensorCore access via inline PTX assembly |
| cp.async Pipeline | Double-buffered async memory transfers |
| TF32 Precision | 19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error |
| SM 80+ Required | Ampere architecture (RTX 30XX+) required |
Performance
Benchmark Comparison (RTX 3090 Ti, 8192×8192)
| Library | FP32 | TF32 | Requirements |
|---|---|---|---|
| NumPy (OpenBLAS) | ~0.8 TFLOPS | — | CPU only |
| cuBLAS | ~21 TFLOPS | ~59 TFLOPS | CUDA Toolkit |
| PyGPUkit | 16.7 TFLOPS | 29.7 TFLOPS | GPU drivers only |
Built-in matmul kernels are pre-compiled. Driver-Only and Full (JIT) modes have identical matmul performance. JIT is only needed for custom kernels.
PyGPUkit Performance by Matrix Size
| Matrix Size | FP32 | TF32 | FP16 | BF16 |
|---|---|---|---|---|
| 2048×2048 | 9.6 TFLOPS | 13.2 TFLOPS | 2.4 TFLOPS | 2.4 TFLOPS |
| 4096×4096 | 14.7 TFLOPS | 22.8 TFLOPS | 2.4 TFLOPS | 2.3 TFLOPS |
| 8192×8192 | 16.7 TFLOPS | 29.7 TFLOPS | 2.3 TFLOPS | 2.3 TFLOPS |
Note: FP16/BF16 matmul uses simple kernels with FP32 accumulation. TensorCore optimization planned for future releases (see Issue #60).
Installation
pip install pygpukit
From source:
git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .
Requirements
- Python 3.10+
- NVIDIA GPU with drivers installed
- Optional: CUDA Toolkit (for JIT compilation of custom kernels)
Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.
Supported GPUs
- RTX 30XX series (Ampere, SM 80+) and above
- Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT supported (SM < 80)
Runtime Modes
| Mode | Requirements | Features |
|---|---|---|
| Full JIT | GPU drivers + CUDA Toolkit | All features including custom kernels |
| Pre-compiled | GPU drivers only | Built-in ops (matmul, add, mul) |
| CPU simulation | None | Testing/development without GPU |
Quick Start
Basic Operations
import pygpukit as gp
# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")
# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)
# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)
Custom JIT Kernel (requires CUDA Toolkit)
src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) x[idx] *= factor;
}
'''
if gp.is_nvrtc_available():
kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)
else:
print("JIT not available. Using pre-compiled ops.")
Rust Scheduler
import _pygpukit_rust as rust
# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)
# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)
# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
rust.PartitionLimits().memory(4*1024**3).compute(0.5))
Features
Core Infrastructure (Rust)
| Feature | Description |
|---|---|
| Memory Pool | LRU eviction, size-class free lists |
| Scheduler | Priority queue, memory reservation |
| Transfer Engine | Separate H2D/D2H streams, priority |
| Kernel Dispatch | Per-stream limits, lifecycle tracking |
Advanced Scheduler
| Feature | Description |
|---|---|
| Admission Control | Deterministic admission, quota enforcement |
| QoS Policy | Guaranteed/Burstable/BestEffort tiers |
| Kernel Pacing | Bandwidth-based throttling per stream |
| GPU Partitioning | Resource isolation, multi-tenant support |
Project Goals
- Provide the smallest usable GPU runtime for Python
- Expose GPU scheduling (bandwidth, memory, partitioning)
- Make writing custom GPU kernels easy
- Serve as a building block for inference engines, DSP systems, and real-time workloads
Project Structure
PyGPUkit/
src/pygpukit/ # Python API (NumPy-compatible)
native/ # C++ backend (CUDA Driver API, NVRTC)
rust/ # Rust backend (memory pool, scheduler)
pygpukit-core/ # Pure Rust core logic
pygpukit-python/ # PyO3 bindings
examples/ # Demo scripts
tests/ # Test suite
Roadmap
Released
| Version | Highlights |
|---|---|
| v0.1 | GPUArray, NVRTC JIT, add/mul/matmul, wheels |
| v0.2.0 | Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests |
| v0.2.1 | API stabilization, error propagation |
| v0.2.2 | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
| v0.2.3 | TF32 TensorCore (PTX mma.sync), 28 TFLOPS |
| v0.2.4 | Single-binary distribution, dynamic NVRTC, driver-only mode |
| v0.2.5 | FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |
Planned
| Version | Goals |
|---|---|
| v0.2.6 | FP16/BF16 TensorCore optimization, Multi-GPU detection |
| v0.2.7 | Full API review, documentation, backward compatibility |
| v0.3 | Triton backend, advanced ops (softmax, layernorm), MPS/MIG |
Contributing
Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.
License
MIT License
Acknowledgements
Inspired by: CUDA Runtime, NVRTC, PyCUDA, CuPy, Triton
PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygpukit-0.2.5.tar.gz.
File metadata
- Download URL: pygpukit-0.2.5.tar.gz
- Upload date:
- Size: 226.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73819efb67802f1576b6325a2e7baaa5031c3aa332f972538d565a0f90fd731e
|
|
| MD5 |
293f40ffc7e5b4a376c91ade1b8f0179
|
|
| BLAKE2b-256 |
f62d8d966cbcd4920eff674775acbf0c457f5db28b7d691055c0d5302e813e25
|
Provenance
The following attestation bundles were made for pygpukit-0.2.5.tar.gz:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.5.tar.gz -
Subject digest:
73819efb67802f1576b6325a2e7baaa5031c3aa332f972538d565a0f90fd731e - Sigstore transparency entry: 764242146
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@47ea531859596c154dc171db42f9c2ad16fedf65 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.5-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: pygpukit-0.2.5-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
784acb65057876cfd315f075f9501c263b0db19370ea3f8dfe445dd789b8df4b
|
|
| MD5 |
e16b6db01c105e68521086e97ee0a031
|
|
| BLAKE2b-256 |
30346581c7e0aa0fa0e28c1c594f4cdfaee475e2c6756b63aa44cb5810b1e75b
|
Provenance
The following attestation bundles were made for pygpukit-0.2.5-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.5-cp312-cp312-win_amd64.whl -
Subject digest:
784acb65057876cfd315f075f9501c263b0db19370ea3f8dfe445dd789b8df4b - Sigstore transparency entry: 764242149
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@47ea531859596c154dc171db42f9c2ad16fedf65 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.
File metadata
- Download URL: pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43772144e649c1ed4bb4d0cbac651d2c6a7caad4a0250d50e07a280f9feec136
|
|
| MD5 |
1d42bb9f260b557febb0db9fbc6ffbfe
|
|
| BLAKE2b-256 |
44c67e9df5d092930034f45e11162e50fbdf18778a3fcdbaddc96cae68b86620
|
Provenance
The following attestation bundles were made for pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl -
Subject digest:
43772144e649c1ed4bb4d0cbac651d2c6a7caad4a0250d50e07a280f9feec136 - Sigstore transparency entry: 764242153
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65 -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@47ea531859596c154dc171db42f9c2ad16fedf65 -
Trigger Event:
push
-
Statement type: