Skip to main content

A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

PyPI version License: MIT


Overview

PyGPUkit is a lightweight GPU runtime for Python that provides:

  • Rust-powered scheduler with admission control, QoS, and resource partitioning
  • NVRTC-based JIT kernel compilation
  • A NumPy-like GPUArray type
  • Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)
  • Extensible operator set (add/mul/matmul, custom kernels)
  • Minimal dependencies and embeddable runtime

PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.


v0.2 Features (NEW)

Core Infrastructure (Rust)

Feature Description
Memory Pool LRU eviction, size-class free lists
Scheduler Priority queue, memory reservation
Transfer Engine Separate H2D/D2H streams, priority
Kernel Dispatch Per-stream limits, lifecycle tracking

Advanced Features (Rust)

Feature Description
Admission Control Deterministic admission, quota enforcement
QoS Policy Guaranteed/Burstable/BestEffort tiers
Kernel Pacing Bandwidth-based throttling per stream
Micro-Slicing Kernel splitting, round-robin fairness
Pinned Memory Page-locked host memory with pooling
Kernel Cache PTX caching, LRU eviction, TTL
GPU Partitioning Resource isolation, multi-tenant support
Tiled Matmul Shared memory + double buffering

Performance (RTX 3090 Ti)

Matrix Size Performance vs NumPy
512x512 1262 GFLOPS 11.6x
1024x1024 1350 GFLOPS 2.2x
2048x2048 4417 GFLOPS 6.1x
4096x4096 6555 GFLOPS 7.9x

Features

  • Lightweight — no PyTorch/CuPy overhead
  • Modular — runtime / memory / scheduler / JIT / ops
  • Rust Backend — memory pool, scheduler, dispatch in Rust
  • GPUArray with NumPy interop
  • NVRTC JIT for CUDA kernels
  • Advanced Scheduler with memory & bandwidth guarantees
  • 106 Rust tests for core components

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements:

  • Python 3.10+
  • CUDA 11+
  • NVRTC available
  • NVIDIA GPU

Supported GPUs:

  • RTX 30XX series (Ampere) and above
  • Performance tuning is optimized for GPUs with large L2 cache (6MB+)
  • Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT tuned and may have suboptimal performance

Project Goals

  1. Provide the smallest usable GPU runtime for Python
  2. Expose GPU scheduling (bandwidth, memory, partitioning)
  3. Make writing custom GPU kernels easy
  4. Serve as a building block for inference engines, DSP systems, and real-time workloads

Usage Examples

Allocate Arrays

import pygpukit as gp

x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

Basic Operations

z = gp.add(x, y)
w = gp.matmul(x, y)

CPU <-> GPU Transfer

arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom NVRTC Kernel

extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)

Rust Scheduler (v0.2)

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Scheduler — Kubernetes-Inspired GPU Orchestration

PyGPUkit includes an experimental scheduler that treats a single GPU as a multi-tenant compute node, similar to how Kubernetes orchestrates CPU workloads. The goal is to provide resource isolation, guarantees, and fair sharing across multiple GPU tasks.

Core Capabilities


1. GPU Memory Reservation

Tasks may request a guaranteed block of GPU memory.

  • Hard guarantees -> task is rejected if memory cannot be allocated
  • Soft guarantees -> best-effort allocation
  • Overcommit strategies (evict to host when pressure is high)
  • Reclaim policies (LRU GPUArray eviction)

Example:

task = scheduler.submit(
    fn,
    memory="512MB",
)

2. GPU Bandwidth Guarantees / Throttling

Tasks may request a specific percentage of GPU compute bandwidth.

Bandwidth control is implemented via:

  • Stream priority
  • Kernel pacing (launch intervals)
  • Micro-slicing large kernels
  • Cooperative time-quantized scheduling
  • Persistent dispatcher kernels (planned)

Example:

task = scheduler.submit(
    fn,
    bandwidth=0.20,   # 20% GPU compute share
)

3. Logical GPU Partitioning

PyGPUkit implements software-defined GPU slicing, similar in spirit to Kubernetes device plugin resource partitioning.

Slices may define:

  • Memory quota
  • Bandwidth share
  • Stream priority band
  • Isolation level

Useful for:

  • Multi-tenant inference servers
  • Real-time audio/DSP workloads
  • Background/foreground GPU task separation

4. Scheduling Policies

The scheduler supports multiple policies:

  • Guaranteed — exclusive reservation, strict QoS
  • Burstable — partial guarantees, opportunistic bandwidth
  • BestEffort — uses leftover GPU cycles
  • Priority scheduling
  • Deadline scheduling (planned)
  • Weighted fair sharing

Example:

task = scheduler.submit(
    fn,
    policy="guaranteed",
    memory="1GB",
    bandwidth=0.10,
)

5. Admission Control

Before executing a task, the scheduler performs:

  • Resource validation
  • Quota check
  • QoS matching
  • Scheduling feasibility

Results in:

  • admitted
  • queued
  • rejected

6. Monitoring & Introspection

PyGPUkit exposes live metrics:

  • Memory usage per task
  • SM occupancy and GPU utilization
  • Throttling / pacing logs
  • Queue position / execution state
  • Reclaim/eviction count

Example:

stats = scheduler.stats(task_id)

7. Soft Isolation Model

While not OS-level isolation, each GPU task is provided:

  • Dedicated stream groups
  • Guaranteed memory pools
  • Kernel pacing to enforce bandwidth
  • Optional sandboxed GPUArray region

This provides practical multi-tenant safety without MIG/MPS.


Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver/Runtime/NVRTC)
  rust/            # Rust backend (memory pool, scheduler, dispatch)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  examples/        # Demo scripts
  tests/           # Test suite

Roadmap

v0.1 (Released)

  • GPUArray
  • NVRTC JIT
  • add/mul/matmul ops
  • Basic stream manager
  • Packaging + wheels

v0.2 (Released)

  • Rust Memory Pool (LRU, size-class)
  • Rust Scheduler (priority, memory reservation)
  • Rust Transfer Engine (async H2D/D2H)
  • Rust Kernel Dispatch Controller
  • Admission Control
  • QoS Policy Framework (Guaranteed/Burstable/BestEffort)
  • Kernel Pacing Engine
  • Micro-Slicing Framework
  • Pinned Memory Support
  • Kernel Cache (PTX caching)
  • GPU Partitioning
  • Tiled Matmul (shared memory)
  • 106 Rust tests

v0.3 (Planned)

  • Triton optional backend
  • Advanced ops (softmax, layernorm)
  • Inference-oriented plugin system
  • MPS/MIG integration

Contributing

Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.


License

MIT License


Acknowledgements

Inspired by:

  • CUDA Runtime
  • NVRTC
  • PyCUDA
  • CuPy
  • Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.0.tar.gz (149.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pygpukit-0.2.0-cp312-cp312-win_amd64.whl (437.8 kB view details)

Uploaded CPython 3.12Windows x86-64

pygpukit-0.2.0-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (467.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.0.tar.gz.

File metadata

  • Download URL: pygpukit-0.2.0.tar.gz
  • Upload date:
  • Size: 149.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 554be77a5437f8d9b9f3a8a33aa260d4312b7de90a8273cff5a6642defd26ae9
MD5 b3e37d0a06df6df23bf1e6288032d1e2
BLAKE2b-256 49aedfa3e4b5bf6760e2e83dc878781e00917c6e3df9cc15c6cf846b229f104a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.0.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pygpukit-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: pygpukit-0.2.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 437.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1183dd89d2d519f39d846ecc27fcffe2b332a018669b59e5f38e59b4783ce044
MD5 d3f6fee23d608ab57d48ab0fb165c40a
BLAKE2b-256 096ba3e6730bee5212910bdcbc85d8ff17b783f91c7c9a6a990c21b538f89412

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pygpukit-0.2.0-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pygpukit-0.2.0-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 b6c91dedefaf4ddbaf84e707a9894f2170fb0716964f01829d7423b15e153095
MD5 d460ff4820ed3a643f11dfa365a86e18
BLAKE2b-256 77647d093cdfc230ddd07ad34ee737c70ab6944a3605aca9c51cb66452c87ad5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.0-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page