A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API
Project description
PyGPUkit — Lightweight GPU Runtime for Python
A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.
Overview
PyGPUkit is a lightweight GPU runtime for Python that provides:
- Rust-powered scheduler with admission control, QoS, and resource partitioning
- NVRTC-based JIT kernel compilation
- A NumPy-like
GPUArraytype - Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)
- Extensible operator set (add/mul/matmul, custom kernels)
- Minimal dependencies and embeddable runtime
PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.
v0.2.2 Features (NEW)
Ampere-Optimized SGEMM
| Feature | Description |
|---|---|
| cp.async Pipeline | 4-stage software pipeline with async memory transfers |
| Vectorized Loads | float4 (16-byte) loads for A and B matrices |
| Shared Memory Tiling | BM=128, BN=128, BK=16 with 8x8 thread tiles |
| SM 80+ Required | Ampere architecture (RTX 30XX+) required |
Performance (RTX 3090 Ti)
| Matrix Size | TFLOPS | Efficiency | vs NumPy |
|---|---|---|---|
| 2048x2048 | 7.6 | 19% | 10x |
| 4096x4096 | 13.2 | 33% | 16x |
| 8192x8192 | 18.2 | 46% | 22x |
Core Infrastructure (Rust)
| Feature | Description |
|---|---|
| Memory Pool | LRU eviction, size-class free lists |
| Scheduler | Priority queue, memory reservation |
| Transfer Engine | Separate H2D/D2H streams, priority |
| Kernel Dispatch | Per-stream limits, lifecycle tracking |
Advanced Features (Rust)
| Feature | Description |
|---|---|
| Admission Control | Deterministic admission, quota enforcement |
| QoS Policy | Guaranteed/Burstable/BestEffort tiers |
| Kernel Pacing | Bandwidth-based throttling per stream |
| Micro-Slicing | Kernel splitting, round-robin fairness |
| Pinned Memory | Page-locked host memory with pooling |
| Kernel Cache | PTX caching, LRU eviction, TTL |
| GPU Partitioning | Resource isolation, multi-tenant support |
Features
- Lightweight — no PyTorch/CuPy overhead
- Modular — runtime / memory / scheduler / JIT / ops
- Rust Backend — memory pool, scheduler, dispatch in Rust
- GPUArray with NumPy interop
- NVRTC JIT for CUDA kernels
- Advanced Scheduler with memory & bandwidth guarantees
- 106 Rust tests for core components
Installation
pip install pygpukit
From source:
git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .
Requirements:
- Python 3.10+
- CUDA 11+
- NVRTC available
- NVIDIA GPU
Supported GPUs:
- RTX 30XX series (Ampere) and above
- Performance tuning is optimized for GPUs with large L2 cache (6MB+)
- Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT tuned and may have suboptimal performance
Project Goals
- Provide the smallest usable GPU runtime for Python
- Expose GPU scheduling (bandwidth, memory, partitioning)
- Make writing custom GPU kernels easy
- Serve as a building block for inference engines, DSP systems, and real-time workloads
Usage Examples
Allocate Arrays
import pygpukit as gp
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")
Basic Operations
z = gp.add(x, y)
w = gp.matmul(x, y)
CPU <-> GPU Transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)
Custom NVRTC Kernel
extern "C" __global__
void scale(float* x, float factor, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) x[idx] *= factor;
}
kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)
Rust Scheduler (v0.2)
import _pygpukit_rust as rust
# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)
# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)
# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
rust.PartitionLimits().memory(4*1024**3).compute(0.5))
Scheduler — Kubernetes-Inspired GPU Orchestration
PyGPUkit includes an experimental scheduler that treats a single GPU as a multi-tenant compute node, similar to how Kubernetes orchestrates CPU workloads. The goal is to provide resource isolation, guarantees, and fair sharing across multiple GPU tasks.
Core Capabilities
1. GPU Memory Reservation
Tasks may request a guaranteed block of GPU memory.
- Hard guarantees -> task is rejected if memory cannot be allocated
- Soft guarantees -> best-effort allocation
- Overcommit strategies (evict to host when pressure is high)
- Reclaim policies (LRU GPUArray eviction)
Example:
task = scheduler.submit(
fn,
memory="512MB",
)
2. GPU Bandwidth Guarantees / Throttling
Tasks may request a specific percentage of GPU compute bandwidth.
Bandwidth control is implemented via:
- Stream priority
- Kernel pacing (launch intervals)
- Micro-slicing large kernels
- Cooperative time-quantized scheduling
- Persistent dispatcher kernels (planned)
Example:
task = scheduler.submit(
fn,
bandwidth=0.20, # 20% GPU compute share
)
3. Logical GPU Partitioning
PyGPUkit implements software-defined GPU slicing, similar in spirit to Kubernetes device plugin resource partitioning.
Slices may define:
- Memory quota
- Bandwidth share
- Stream priority band
- Isolation level
Useful for:
- Multi-tenant inference servers
- Real-time audio/DSP workloads
- Background/foreground GPU task separation
4. Scheduling Policies
The scheduler supports multiple policies:
- Guaranteed — exclusive reservation, strict QoS
- Burstable — partial guarantees, opportunistic bandwidth
- BestEffort — uses leftover GPU cycles
- Priority scheduling
- Deadline scheduling (planned)
- Weighted fair sharing
Example:
task = scheduler.submit(
fn,
policy="guaranteed",
memory="1GB",
bandwidth=0.10,
)
5. Admission Control
Before executing a task, the scheduler performs:
- Resource validation
- Quota check
- QoS matching
- Scheduling feasibility
Results in:
- admitted
- queued
- rejected
6. Monitoring & Introspection
PyGPUkit exposes live metrics:
- Memory usage per task
- SM occupancy and GPU utilization
- Throttling / pacing logs
- Queue position / execution state
- Reclaim/eviction count
Example:
stats = scheduler.stats(task_id)
7. Soft Isolation Model
While not OS-level isolation, each GPU task is provided:
- Dedicated stream groups
- Guaranteed memory pools
- Kernel pacing to enforce bandwidth
- Optional sandboxed GPUArray region
This provides practical multi-tenant safety without MIG/MPS.
Project Structure
PyGPUkit/
src/pygpukit/ # Python API (NumPy-compatible)
native/ # C++ backend (CUDA Driver/Runtime/NVRTC)
rust/ # Rust backend (memory pool, scheduler, dispatch)
pygpukit-core/ # Pure Rust core logic
pygpukit-python/ # PyO3 bindings
examples/ # Demo scripts
tests/ # Test suite
Roadmap
v0.1 (Released)
- GPUArray
- NVRTC JIT
- add/mul/matmul ops
- Basic stream manager
- Packaging + wheels
v0.2.0 (Released)
- Rust Memory Pool (LRU, size-class)
- Rust Scheduler (priority, memory reservation)
- Rust Transfer Engine (async H2D/D2H)
- Rust Kernel Dispatch Controller
- Admission Control
- QoS Policy Framework (Guaranteed/Burstable/BestEffort)
- Kernel Pacing Engine
- Micro-Slicing Framework
- Pinned Memory Support
- Kernel Cache (PTX caching)
- GPU Partitioning
- Tiled Matmul (shared memory)
- 106 Rust tests
v0.2.1 — Stabilization Phase (Released)
- Admission / QoS spec finalization
- Python API inconsistency fixes
- Rust error propagation unification
v0.2.2 — Performance Phase (Released)
- Ampere-optimized SGEMM with cp.async pipeline
- 4-stage software pipelining for latency hiding
- float4 vectorized memory loads
- 18.2 TFLOPS on RTX 3090 Ti (46% efficiency)
- SM 80+ (Ampere) architecture requirement
v0.2.3 — Reliability Phase
- Kernel cache LRU completion
- Driver-only mode stabilization
- Windows/Linux full support
- Large GPU memory test (16GB continuous alloc/free)
v0.2.4 — Distributed Phase
- Multi-GPU Detection
- NCCL / peer-to-peer preliminary support
- Scheduler multi-device support
v0.2.5 — Pre-v0.3 Finalization
- Full API review
- Backward compatibility policy
- JIT build options, safety measures, env vars cleanup
- Documentation
v0.3 (Planned)
- Triton optional backend
- Advanced ops (softmax, layernorm)
- Inference-oriented plugin system
- MPS/MIG integration
Contributing
Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.
License
MIT License
Acknowledgements
Inspired by:
- CUDA Runtime
- NVRTC
- PyCUDA
- CuPy
- Triton
PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygpukit-0.2.2.tar.gz.
File metadata
- Download URL: pygpukit-0.2.2.tar.gz
- Upload date:
- Size: 169.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b541fda1510d9e5be3de867b0c3547038e728b448e4a42267fc768fc857454e8
|
|
| MD5 |
56c11ba67f1ab885041ecf320da1cdf8
|
|
| BLAKE2b-256 |
307cc924986da1d67045ea7d7e5e502dc9ef8e3e15645c4b91dc0845c94a965a
|
Provenance
The following attestation bundles were made for pygpukit-0.2.2.tar.gz:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.2.tar.gz -
Subject digest:
b541fda1510d9e5be3de867b0c3547038e728b448e4a42267fc768fc857454e8 - Sigstore transparency entry: 763109509
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@412b5507b4d7676a1516c4fc0516e192c22913cb -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@412b5507b4d7676a1516c4fc0516e192c22913cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.2-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: pygpukit-0.2.2-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 799.8 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9d5f56ea7094aa75077ab9eca516b5d8252f31c846974466fb8030327428f5c
|
|
| MD5 |
81f94f2130c2a9363118568770620581
|
|
| BLAKE2b-256 |
12173badb1dbf43e6c29a00faf1dddf4aee043dff8fd57323b66af3d5c936831
|
Provenance
The following attestation bundles were made for pygpukit-0.2.2-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.2-cp312-cp312-win_amd64.whl -
Subject digest:
a9d5f56ea7094aa75077ab9eca516b5d8252f31c846974466fb8030327428f5c - Sigstore transparency entry: 763109511
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@412b5507b4d7676a1516c4fc0516e192c22913cb -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@412b5507b4d7676a1516c4fc0516e192c22913cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.2-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.
File metadata
- Download URL: pygpukit-0.2.2-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Upload date:
- Size: 830.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92d300a251266924141a8d6c7362a7ed0e6166381577c1dc808622e9c7311020
|
|
| MD5 |
b82fe28cc5d20340f290b2109603762a
|
|
| BLAKE2b-256 |
9d6d2a3a108c55765e743297d4a7ccfb1fc0b69b59129c11e4386e6ce024617b
|
Provenance
The following attestation bundles were made for pygpukit-0.2.2-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.2-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl -
Subject digest:
92d300a251266924141a8d6c7362a7ed0e6166381577c1dc808622e9c7311020 - Sigstore transparency entry: 763109514
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@412b5507b4d7676a1516c4fc0516e192c22913cb -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@412b5507b4d7676a1516c4fc0516e192c22913cb -
Trigger Event:
push
-
Statement type: