A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

Overview

PyGPUkit is a lightweight GPU runtime for Python that provides:

Rust-powered scheduler with admission control, QoS, and resource partitioning
NVRTC-based JIT kernel compilation
A NumPy-like GPUArray type
Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)
Extensible operator set (add/mul/matmul, custom kernels)
Minimal dependencies and embeddable runtime

PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.

Opening Paragraph (Goal Statement)

PyGPUkit aims to simplify GPU development by reducing dependency on complex CUDA Toolkit installations and fragile GPU environments. Its goal is to make GPU programming feel like using a standard Python library: installable via pip with minimal setup. PyGPUkit provides high-performance GPU kernels, memory management, and scheduling through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively.

Note: PyGPUkit currently requires CUDA drivers and NVRTC. It is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads, research, and real-time systems where full ML frameworks are overkill.

v0.2.3 Features (NEW)

TF32 TensorCore GEMM

Feature	Description
PTX mma.sync	Direct TensorCore access via inline PTX assembly
cp.async Pipeline	Double-buffered async memory transfers
TF32 Precision	19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error
SM 80+ Required	Ampere architecture (RTX 30XX+) required

Benchmark Comparison (RTX 3090 Ti, 8192×8192×8192)

Library	FP32	TF32	Notes
NumPy (OpenBLAS)	~0.8 TFLOPS	—	CPU baseline
cuBLAS	~21 TFLOPS	~59 TFLOPS	NVIDIA benchmark
PyGPUkit	18 TFLOPS (86%)	27 TFLOPS (46%)	Custom kernels

FP32 is near cuBLAS level. TF32 optimization ongoing.

PyGPUkit Performance by Size

Matrix Size	FP32	TF32
2048×2048	7.6 TFLOPS	10.2 TFLOPS
4096×4096	13.2 TFLOPS	19.5 TFLOPS
8192×8192	18.2 TFLOPS	27.5 TFLOPS

Core Infrastructure (Rust)

Feature	Description
Memory Pool	LRU eviction, size-class free lists
Scheduler	Priority queue, memory reservation
Transfer Engine	Separate H2D/D2H streams, priority
Kernel Dispatch	Per-stream limits, lifecycle tracking

Advanced Features (Rust)

Feature	Description
Admission Control	Deterministic admission, quota enforcement
QoS Policy	Guaranteed/Burstable/BestEffort tiers
Kernel Pacing	Bandwidth-based throttling per stream
Micro-Slicing	Kernel splitting, round-robin fairness
Pinned Memory	Page-locked host memory with pooling
Kernel Cache	PTX caching, LRU eviction, TTL
GPU Partitioning	Resource isolation, multi-tenant support

Features

Lightweight — smaller footprint than PyTorch/CuPy (not a replacement)
Modular — runtime / memory / scheduler / JIT / ops
Rust Backend — memory pool, scheduler, dispatch in Rust
GPUArray with NumPy interop
NVRTC JIT for CUDA kernels
Advanced Scheduler with memory & bandwidth guarantees
106 Rust tests for core components

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements:

Python 3.10+
CUDA 11+
NVRTC available
NVIDIA GPU

Supported GPUs:

RTX 30XX series (Ampere) and above
Performance tuning is optimized for GPUs with large L2 cache (6MB+)
Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT tuned and may have suboptimal performance

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Usage Examples

Allocate Arrays

import pygpukit as gp

x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

Basic Operations

z = gp.add(x, y)
w = gp.matmul(x, y)

CPU <-> GPU Transfer

arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom NVRTC Kernel

extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}

kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)

Rust Scheduler (v0.2)

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Scheduler — Kubernetes-Inspired GPU Orchestration

PyGPUkit includes an experimental scheduler that treats a single GPU as a multi-tenant compute node, similar to how Kubernetes orchestrates CPU workloads. The goal is to provide resource isolation, guarantees, and fair sharing across multiple GPU tasks.

Core Capabilities

1. GPU Memory Reservation

Tasks may request a guaranteed block of GPU memory.

Hard guarantees -> task is rejected if memory cannot be allocated
Soft guarantees -> best-effort allocation
Overcommit strategies (evict to host when pressure is high)
Reclaim policies (LRU GPUArray eviction)

Example:

task = scheduler.submit(
    fn,
    memory="512MB",
)

2. GPU Bandwidth Guarantees / Throttling

Tasks may request a specific percentage of GPU compute bandwidth.

Bandwidth control is implemented via:

Stream priority
Kernel pacing (launch intervals)
Micro-slicing large kernels
Cooperative time-quantized scheduling
Persistent dispatcher kernels (planned)

Example:

task = scheduler.submit(
    fn,
    bandwidth=0.20,   # 20% GPU compute share
)

3. Logical GPU Partitioning

PyGPUkit implements software-defined GPU slicing, similar in spirit to Kubernetes device plugin resource partitioning.

Slices may define:

Memory quota
Bandwidth share
Stream priority band
Isolation level

Useful for:

Multi-tenant inference servers
Real-time audio/DSP workloads
Background/foreground GPU task separation

4. Scheduling Policies

The scheduler supports multiple policies:

Guaranteed — exclusive reservation, strict QoS
Burstable — partial guarantees, opportunistic bandwidth
BestEffort — uses leftover GPU cycles
Priority scheduling
Deadline scheduling (planned)
Weighted fair sharing

Example:

task = scheduler.submit(
    fn,
    policy="guaranteed",
    memory="1GB",
    bandwidth=0.10,
)

5. Admission Control

Before executing a task, the scheduler performs:

Resource validation
Quota check
QoS matching
Scheduling feasibility

Results in:

admitted
queued
rejected

6. Monitoring & Introspection

PyGPUkit exposes live metrics:

Memory usage per task
SM occupancy and GPU utilization
Throttling / pacing logs
Queue position / execution state
Reclaim/eviction count

Example:

stats = scheduler.stats(task_id)

7. Soft Isolation Model

While not OS-level isolation, each GPU task is provided:

Dedicated stream groups
Guaranteed memory pools
Kernel pacing to enforce bandwidth
Optional sandboxed GPUArray region

This provides practical multi-tenant safety without MIG/MPS.

Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver/Runtime/NVRTC)
  rust/            # Rust backend (memory pool, scheduler, dispatch)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  examples/        # Demo scripts
  tests/           # Test suite

Roadmap

v0.1 — v0.2.3 (Released)

Version	Highlights
v0.1	GPUArray, NVRTC JIT, add/mul/matmul, wheels
v0.2.0	Rust scheduler (QoS, admission control, partitioning), memory pool (LRU), kernel cache, 106 Rust tests
v0.2.1	API stabilization, error propagation
v0.2.2	Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32
v0.2.3	TF32 TensorCore (PTX mma.sync), 27.5 TFLOPS

v0.2.4 — Benchmark & Reliability Phase

Actual PyTorch/NumPy comparison benchmarks
Kernel cache LRU completion
Driver-only mode stabilization
Windows/Linux full support
Large GPU memory test (16GB continuous alloc/free)

v0.2.5 — Distributed Phase

Multi-GPU Detection
NCCL / peer-to-peer preliminary support
Scheduler multi-device support

v0.2.6 — Pre-v0.3 Finalization

Full API review
Backward compatibility policy
JIT build options, safety measures, env vars cleanup
Documentation

v0.3 (Planned)

Triton optional backend
Advanced ops (softmax, layernorm)
Inference-oriented plugin system
MPS/MIG integration

Contributing

Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.

License

MIT License

Acknowledgements

Inspired by:

CUDA Runtime
NVRTC
PyCUDA
CuPy
Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.19

Jan 1, 2026

0.2.18

Dec 30, 2025

0.2.17

Dec 28, 2025

0.2.16

Dec 28, 2025

0.2.15

Dec 26, 2025

0.2.14

Dec 23, 2025

0.2.13

Dec 23, 2025

0.2.12

Dec 22, 2025

0.2.11

Dec 22, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 16, 2025

0.2.8

Dec 15, 2025

0.2.7

Dec 15, 2025

0.2.6

Dec 15, 2025

0.2.5

Dec 15, 2025

0.2.4

Dec 14, 2025

This version

0.2.3

Dec 14, 2025

0.2.2

Dec 13, 2025

0.2.0

Dec 12, 2025

0.1.3

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.3.tar.gz (190.4 kB view details)

Uploaded Dec 14, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpukit-0.2.3-cp312-cp312-win_amd64.whl (882.9 kB view details)

Uploaded Dec 14, 2025 CPython 3.12Windows x86-64

pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (913.8 kB view details)

Uploaded Dec 14, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.3.tar.gz.

File metadata

Download URL: pygpukit-0.2.3.tar.gz
Upload date: Dec 14, 2025
Size: 190.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`2ff478259e75033f7006174f33f4c4e01b46d825711b114ac6880a5c52090b72`
MD5	`a1ddced3463e99987f352645cdb8db2f`
BLAKE2b-256	`11a7a4f06e03ce80042b60ad9fa341a32a0ec147b07fad5ad09b55179e15be3f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.3.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.3.tar.gz
- Subject digest: 2ff478259e75033f7006174f33f4c4e01b46d825711b114ac6880a5c52090b72
- Sigstore transparency entry: 763742623
- Sigstore integration time: Dec 14, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Trigger Event: push

File details

Details for the file pygpukit-0.2.3-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pygpukit-0.2.3-cp312-cp312-win_amd64.whl
Upload date: Dec 14, 2025
Size: 882.9 kB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.3-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`0d7cc3670f225d9fb46031faf27a23aeb2ef7dfad344b61df6a7ed35e189548e`
MD5	`c41fd2f3d60c38b9cb845f0123bc05b6`
BLAKE2b-256	`b51fcbda4057c061183662b13fbf5350bec998224a81d0548fa1214ab31ea616`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.3-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.3-cp312-cp312-win_amd64.whl
- Subject digest: 0d7cc3670f225d9fb46031faf27a23aeb2ef7dfad344b61df6a7ed35e189548e
- Sigstore transparency entry: 763742626
- Sigstore integration time: Dec 14, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Trigger Event: push

File details

Details for the file pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

Download URL: pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Upload date: Dec 14, 2025
Size: 913.8 kB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`68359086f29e360a0f5adcaf136f766e6d5ef2eec9310174e223dc553bc04bd2`
MD5	`049433b60194d9d74bffbbb9e1e2cebf`
BLAKE2b-256	`cd62ff5a61bc1cb3e59f2a8355847ca74c5a7f23d47920c9f38147397a2ae177`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.3-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Subject digest: 68359086f29e360a0f5adcaf136f766e6d5ef2eec9310174e223dc553bc04bd2
- Sigstore transparency entry: 763742627
- Sigstore integration time: Dec 14, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@04cfa536ca03bc6e9a394fbe73a36c6fd41df509
- Trigger Event: push

PyGPUkit 0.2.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyGPUkit — Lightweight GPU Runtime for Python

Overview

Opening Paragraph (Goal Statement)

v0.2.3 Features (NEW)

TF32 TensorCore GEMM

Benchmark Comparison (RTX 3090 Ti, 8192×8192×8192)

PyGPUkit Performance by Size

Core Infrastructure (Rust)

Advanced Features (Rust)

Features

Installation

Project Goals

Usage Examples

Allocate Arrays

Basic Operations

CPU <-> GPU Transfer

Custom NVRTC Kernel

Rust Scheduler (v0.2)

Scheduler — Kubernetes-Inspired GPU Orchestration

Core Capabilities

1. GPU Memory Reservation

2. GPU Bandwidth Guarantees / Throttling

3. Logical GPU Partitioning

4. Scheduling Policies

5. Admission Control

6. Monitoring & Introspection

7. Soft Isolation Model

Project Structure

Roadmap

v0.1 — v0.2.3 (Released)

v0.2.4 — Benchmark & Reliability Phase

v0.2.5 — Distributed Phase

v0.2.6 — Pre-v0.3 Finalization

v0.3 (Planned)

Contributing

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance