A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

Overview

PyGPUkit is a lightweight GPU runtime for Python that provides:

Single-binary distribution — works with just GPU drivers, no CUDA Toolkit needed
Rust-powered scheduler with admission control, QoS, and resource partitioning
NVRTC JIT (optional) for custom kernel compilation
A NumPy-like GPUArray type
Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)

PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.

Note: PyGPUkit is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads where full ML frameworks are overkill.

What's New in v0.2.5

FP16 / BF16 Support

Feature	Description
FP16 (float16)	Half-precision floating point
BF16 (bfloat16)	Brain floating point (better dynamic range)
FP32 Accumulation	Numerical stability via FP32 intermediate
Type Conversion	`astype()` for seamless dtype conversion

import pygpukit as gpk
import numpy as np

# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b  # FP16 matmul

# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16  # BF16 matmul
result = c_bf16.astype(gpk.float32)  # Convert back to FP32

Reduction Operations

Operation	Description
`gpk.sum(a)`	Sum of all elements
`gpk.mean(a)`	Mean of all elements
`gpk.max(a)`	Maximum element

Operator Overloads

c = a + b   # Element-wise add
c = a - b   # Element-wise subtract
c = a * b   # Element-wise multiply
c = a / b   # Element-wise divide
c = a @ b   # Matrix multiplication

What's New in v0.2.4

Single-Binary Distribution

Feature	Description
Driver-only mode	Only `nvcuda.dll` (GPU driver) required
Dynamic NVRTC	JIT loaded at runtime, optional
No cudart dependency	Eliminated CUDA Runtime dependency
Smaller wheel	No bundled DLLs

import pygpukit as gp

# Works with just GPU drivers!
print(f"CUDA: {gp.is_cuda_available()}")      # True (if GPU driver installed)
print(f"NVRTC: {gp.is_nvrtc_available()}")    # True (if CUDA Toolkit installed)
print(f"NVRTC Path: {gp.get_nvrtc_path()}")   # Path to NVRTC DLL (if available)

TF32 TensorCore GEMM

Feature	Description
PTX mma.sync	Direct TensorCore access via inline PTX assembly
cp.async Pipeline	Double-buffered async memory transfers
TF32 Precision	19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error
SM 80+ Required	Ampere architecture (RTX 30XX+) required

Performance

Benchmark Comparison (RTX 3090 Ti, 8192×8192)

Library	FP32	TF32	Requirements
NumPy (OpenBLAS)	~0.8 TFLOPS	—	CPU only
cuBLAS	~21 TFLOPS	~59 TFLOPS	CUDA Toolkit
PyGPUkit	16.7 TFLOPS	29.7 TFLOPS	GPU drivers only

Built-in matmul kernels are pre-compiled. Driver-Only and Full (JIT) modes have identical matmul performance. JIT is only needed for custom kernels.

PyGPUkit Performance by Matrix Size

Matrix Size	FP32	TF32	FP16	BF16
2048×2048	9.6 TFLOPS	13.2 TFLOPS	2.4 TFLOPS	2.4 TFLOPS
4096×4096	14.7 TFLOPS	22.8 TFLOPS	2.4 TFLOPS	2.3 TFLOPS
8192×8192	16.7 TFLOPS	29.7 TFLOPS	2.3 TFLOPS	2.3 TFLOPS

Note: FP16/BF16 matmul uses simple kernels with FP32 accumulation. TensorCore optimization planned for future releases (see Issue #60).

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements

Python 3.10+
NVIDIA GPU with drivers installed
Optional: CUDA Toolkit (for JIT compilation of custom kernels)

Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.

Supported GPUs

RTX 30XX series (Ampere, SM 80+) and above
Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT supported (SM < 80)

Runtime Modes

Mode	Requirements	Features
Full JIT	GPU drivers + CUDA Toolkit	All features including custom kernels
Pre-compiled	GPU drivers only	Built-in ops (matmul, add, mul)
CPU simulation	None	Testing/development without GPU

Quick Start

Basic Operations

import pygpukit as gp

# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)

# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom JIT Kernel (requires CUDA Toolkit)

src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
'''

if gp.is_nvrtc_available():
    kernel = gp.jit(src, func="scale")
    kernel(x, factor=0.5, n=x.size)
else:
    print("JIT not available. Using pre-compiled ops.")

Rust Scheduler

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Features

Core Infrastructure (Rust)

Feature	Description
Memory Pool	LRU eviction, size-class free lists
Scheduler	Priority queue, memory reservation
Transfer Engine	Separate H2D/D2H streams, priority
Kernel Dispatch	Per-stream limits, lifecycle tracking

Advanced Scheduler

Feature	Description
Admission Control	Deterministic admission, quota enforcement
QoS Policy	Guaranteed/Burstable/BestEffort tiers
Kernel Pacing	Bandwidth-based throttling per stream
GPU Partitioning	Resource isolation, multi-tenant support

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver API, NVRTC)
  rust/            # Rust backend (memory pool, scheduler)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  examples/        # Demo scripts
  tests/           # Test suite

Roadmap

Released

Version	Highlights
v0.1	GPUArray, NVRTC JIT, add/mul/matmul, wheels
v0.2.0	Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests
v0.2.1	API stabilization, error propagation
v0.2.2	Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32
v0.2.3	TF32 TensorCore (PTX mma.sync), 28 TFLOPS
v0.2.4	Single-binary distribution, dynamic NVRTC, driver-only mode
v0.2.5	FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS)

Planned

Version	Goals
v0.2.6	FP16/BF16 TensorCore optimization, Multi-GPU detection
v0.2.7	Full API review, documentation, backward compatibility
v0.3	Triton backend, advanced ops (softmax, layernorm), MPS/MIG

Contributing

Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.

License

MIT License

Acknowledgements

Inspired by: CUDA Runtime, NVRTC, PyCUDA, CuPy, Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.19

Jan 1, 2026

0.2.18

Dec 30, 2025

0.2.17

Dec 28, 2025

0.2.16

Dec 28, 2025

0.2.15

Dec 26, 2025

0.2.14

Dec 23, 2025

0.2.13

Dec 23, 2025

0.2.12

Dec 22, 2025

0.2.11

Dec 22, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 16, 2025

0.2.8

Dec 15, 2025

0.2.7

Dec 15, 2025

0.2.6

Dec 15, 2025

This version

0.2.5

Dec 15, 2025

0.2.4

Dec 14, 2025

0.2.3

Dec 14, 2025

0.2.2

Dec 13, 2025

0.2.0

Dec 12, 2025

0.1.3

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.5.tar.gz (226.1 kB view details)

Uploaded Dec 15, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpukit-0.2.5-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded Dec 15, 2025 CPython 3.12Windows x86-64

pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (1.3 MB view details)

Uploaded Dec 15, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.5.tar.gz.

File metadata

Download URL: pygpukit-0.2.5.tar.gz
Upload date: Dec 15, 2025
Size: 226.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`73819efb67802f1576b6325a2e7baaa5031c3aa332f972538d565a0f90fd731e`
MD5	`293f40ffc7e5b4a376c91ade1b8f0179`
BLAKE2b-256	`f62d8d966cbcd4920eff674775acbf0c457f5db28b7d691055c0d5302e813e25`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.5.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.5.tar.gz
- Subject digest: 73819efb67802f1576b6325a2e7baaa5031c3aa332f972538d565a0f90fd731e
- Sigstore transparency entry: 764242146
- Sigstore integration time: Dec 15, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65
- Branch / Tag: refs/tags/v0.2.5
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@47ea531859596c154dc171db42f9c2ad16fedf65
- Trigger Event: push

File details

Details for the file pygpukit-0.2.5-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pygpukit-0.2.5-cp312-cp312-win_amd64.whl
Upload date: Dec 15, 2025
Size: 1.1 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.5-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`784acb65057876cfd315f075f9501c263b0db19370ea3f8dfe445dd789b8df4b`
MD5	`e16b6db01c105e68521086e97ee0a031`
BLAKE2b-256	`30346581c7e0aa0fa0e28c1c594f4cdfaee475e2c6756b63aa44cb5810b1e75b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.5-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.5-cp312-cp312-win_amd64.whl
- Subject digest: 784acb65057876cfd315f075f9501c263b0db19370ea3f8dfe445dd789b8df4b
- Sigstore transparency entry: 764242149
- Sigstore integration time: Dec 15, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65
- Branch / Tag: refs/tags/v0.2.5
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@47ea531859596c154dc171db42f9c2ad16fedf65
- Trigger Event: push

File details

Details for the file pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

Download URL: pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Upload date: Dec 15, 2025
Size: 1.3 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`43772144e649c1ed4bb4d0cbac651d2c6a7caad4a0250d50e07a280f9feec136`
MD5	`1d42bb9f260b557febb0db9fbc6ffbfe`
BLAKE2b-256	`44c67e9df5d092930034f45e11162e50fbdf18778a3fcdbaddc96cae68b86620`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.5-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Subject digest: 43772144e649c1ed4bb4d0cbac651d2c6a7caad4a0250d50e07a280f9feec136
- Sigstore transparency entry: 764242153
- Sigstore integration time: Dec 15, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@47ea531859596c154dc171db42f9c2ad16fedf65
- Branch / Tag: refs/tags/v0.2.5
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@47ea531859596c154dc171db42f9c2ad16fedf65
- Trigger Event: push

PyGPUkit 0.2.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyGPUkit — Lightweight GPU Runtime for Python

Overview

What's New in v0.2.5

FP16 / BF16 Support

Reduction Operations

Operator Overloads

What's New in v0.2.4

Single-Binary Distribution

TF32 TensorCore GEMM

Performance

Benchmark Comparison (RTX 3090 Ti, 8192×8192)

PyGPUkit Performance by Matrix Size

Installation

Requirements

Supported GPUs

Runtime Modes

Quick Start

Basic Operations

Custom JIT Kernel (requires CUDA Toolkit)

Rust Scheduler

Features

Core Infrastructure (Rust)

Advanced Scheduler

Project Goals

Project Structure

Roadmap

Released

Planned

Contributing

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance