A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

Documentation

Guide	Description
Getting Started	Installation, quick start, basic usage
API Reference	Complete API documentation with examples
LLM Guide	SafeTensors, GPT-2/LLaMA/Qwen3 inference
Performance Tuning	TF32, FP16, CUTLASS optimization
Scheduler Guide	Multi-LLM concurrent execution

Overview

PyGPUkit is a lightweight GPU runtime for Python that provides:

Single-binary distribution — works with just GPU drivers, no CUDA Toolkit needed
Rust-powered scheduler with admission control, QoS, and resource partitioning
NVRTC JIT (optional) for custom kernel compilation
A NumPy-like GPUArray type
Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)

PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.

Note: PyGPUkit is NOT a PyTorch/CuPy replacement—it's a lightweight runtime for custom GPU workloads where full ML frameworks are overkill.

What's New in v0.2.15

FP8 I/O GEMM (SM120)

Pure FP8 input/output GEMM for FP8 model inference (Llama 3.1 FP8, Qwen FP8, etc.):

Function	Description
`matmul_fp8_fp8_sm120`	FP8 E4M3 input -> FP8 E4M3 output (unity scaling)
`matmul_fp8_fp8_blockwise_sm120`	FP8 with block-wise scale_A / scale_B
`fp8_fp8_get_scale_sizes`	Get required scale factor sizes for (M, N, K)
`fp8_fp8_sm120_available`	Check SM120 FP8 I/O availability

import pygpukit as gpk
import numpy as np

# Check availability
if gpk.fp8_fp8_sm120_available():
    # Get scale sizes for blockwise scaling
    sfa_size, sfb_size = gpk.fp8_fp8_get_scale_sizes(M, N, K)

    # Blockwise scaled FP8 GEMM (for real FP8 models)
    scale_a = gpk.from_numpy(np.ones(sfa_size, dtype=np.float32))
    scale_b = gpk.from_numpy(np.ones(sfb_size, dtype=np.float32))
    C = gpk.matmul_fp8_fp8_blockwise_sm120(A_fp8, B_fp8, scale_a, scale_b)

Pure NVF4 GEMM (446 TFLOPS)

GPU-side BF16->NVF4 quantization with 3-stage pipeline for maximum throughput:

Matrix Size	TFLOPS	Notes
8192x8192	320	Branchless vectorized loads
12288x12288	400	3-stage async pipeline
16384x16384	446	Direct write to user buffer

New Math Operations

Extended math operations for GPU computing:

Category	Operations
Trigonometric	`sin`, `cos`
Power/Root	`sqrt`, `rsqrt`
Sign	`abs`, `neg`
Comparison	`clamp`, `where`
Activation	`sigmoid`, `tanh`
Reduction	`argmax`, `min`, `sum_axis`

import pygpukit as gpk

# Trigonometric
y = gpk.sin(x)
y = gpk.cos(x)

# Power operations
y = gpk.sqrt(x)
y = gpk.rsqrt(x)  # 1/sqrt(x)

# Element-wise comparison
y = gpk.clamp(x, min_val=-1.0, max_val=1.0)
y = gpk.where(cond, x, y)  # cond ? x : y

# New activations
y = gpk.sigmoid(x)
y = gpk.tanh(x)

# New reductions
idx = gpk.argmax(x)     # Index of maximum
val = gpk.min(x)        # Minimum value
y = gpk.sum_axis(x, 1)  # Sum along axis

uint8/int8 NumPy Support

from_numpy now supports uint8 and int8 arrays for FP8 data handling:

# FP8 data stored as uint8
fp8_data = np.array([...], dtype=np.uint8)
gpu_fp8 = gpk.from_numpy(fp8_data)

What's New in v0.2.14

Packaging Fixes

v0.2.13 and v0.2.14 fix wheel RECORD file issues that caused PyPI deprecation warnings.

Version	Issue	Fix
v0.2.14	Windows wheel missing `licenses/LICENSE` in RECORD	Added `-Recurse` to scan dist-info subdirectories
v0.2.13	Hardcoded version in release workflow	Dynamic dist-info folder detection

Recommended: Use v0.2.15 or later.

pip install pygpukit>=0.2.15

What's New in v0.2.12

GPU Audio Processing (Driver-Only)

Comprehensive audio processing operations with custom Radix-2 FFT - no cuFFT dependency.

Category	Operations
Time-Frequency	`stft`, `istft`, `griffin_lim`
Spectral Features	`spectral_centroid`, `spectral_bandwidth`, `spectral_rolloff`, `spectral_flatness`, `spectral_contrast`
Pitch Detection	`detect_pitch_yin`, `detect_pitch_yin_frames`, `autocorrelation`
Music Analysis	`cqt`, `chroma_stft`, `chroma_cqt`, `zero_crossing_rate`
Source Separation	`hpss`, `harmonic`, `percussive`
Time/Pitch	`time_stretch`, `pitch_shift`

from pygpukit.ops import audio
import numpy as np

# Load audio
samples = np.random.randn(16000).astype(np.float32)  # 1 sec @ 16kHz
buf = audio.from_pcm(samples, sample_rate=16000)

# STFT -> Magnitude -> ISTFT roundtrip
stft_out = audio.stft(buf, n_fft=512, hop_length=160)
mag = audio.magnitude_spectrum(stft_out)
reconstructed = audio.griffin_lim(mag, n_iter=32)

# Spectral features
centroid = audio.spectral_centroid(mag, sample_rate=16000)
flatness = audio.spectral_flatness(mag)

# HPSS (Harmonic-Percussive Separation)
harmonic, percussive = audio.hpss(mag, kernel_size=17)

# Time stretch (slow down to half speed)
slow = audio.time_stretch(buf, rate=0.5)

# Pitch shift (+12 semitones = 1 octave up)
higher = audio.pitch_shift(buf, sample_rate=16000, n_steps=12)

Previous Audio Features (v0.2.11)

Feature	Description
STFT	Custom Radix-2 FFT (no cuFFT)
Mel Filterbank	Whisper-compatible preprocessing
MFCC	DCT-II based extraction
VAD	Voice Activity Detection
Streaming	Ring buffer, windowing

What's New in v0.2.11

Batch Decode Support

Batch decoding enables processing multiple tokens in parallel, achieving near-linear speedup with TensorCore utilization.

Batch Size	Per Token (us)	Throughput	Speedup
1	381,303	2.6 tok/s	1.00x
2	205,030	4.9 tok/s	1.86x
4	108,521	9.2 tok/s	3.51x
8	55,845	17.9 tok/s	6.83x

Decode Strategy Framework

Modular decode strategies for different use cases:

from pygpukit.llm import DecodeM1, DecodeM1Graph, DecodeBatch, DecodeJacobi

# Standard single-token decode
m1 = DecodeM1()
m1.bind(model)

# CUDA Graph accelerated decode
m1_graph = DecodeM1Graph()
m1_graph.bind(model)
m1_graph.init_graph(max_seq_len=512)

# Batch decode for high throughput
batch = DecodeBatch(batch_size=8)
batch.bind(model)

Strategy	Throughput	Use Case
DecodeM1	3.2 tok/s	Simple, low memory
DecodeM1Graph	2.2 tok/s	Reduced kernel launch overhead
DecodeBatch (batch=8)	19.6 tok/s	High throughput

CUDA Graph Improvements

Volatile reads for proper graph replay (attention, embedding, KV cache kernels)
Separate DecodeM1Graph strategy for cleaner architecture
Fixed stream handling for RoPE and SDPA operations

Driver API Async Memory Operations

New async memory transfer functions using CUDA Driver API:

from pygpukit.core import memcpy_host_to_device_async, pinned_malloc, pinned_free

# Pinned memory for faster transfers
pinned_ptr = pinned_malloc(size_bytes)
memcpy_host_to_device_async(device_ptr, pinned_ptr, size_bytes, stream)

CUDA 13.x Required

Starting from v0.2.15, PyGPUkit requires CUDA 13.0+ for SM120 (Blackwell) support:

Module	CUDA Version	SM Support
`_pygpukit_native_cu131`	CUDA 13.1	SM 80-120 (Blackwell)

Note: CUDA 12.x builds have been discontinued. SM120 features (FP8 I/O GEMM, NVF4 GEMM) require CUDA 13.0+.

RTX 5090 Support

Full support for NVIDIA Blackwell consumer GPUs (SM120) via CUDA 13.x build.

Qwen2 Architecture Support

Added QWEN2_SPEC for Qwen2/Qwen2.5 model family:

from pygpukit.llm import detect_model_spec, QWEN2_SPEC

spec = detect_model_spec(tensor_names)  # Auto-detects Qwen2
# Or explicitly: spec = QWEN2_SPEC

What's New in v0.2.10

Dynamic cuBLASLt Loading

cuBLASLt is now loaded dynamically at runtime, enabling true driver-only deployment. No CUDA Toolkit installation required on target machines.

Feature	Description
Dynamic Loading	`LoadLibrary`/`dlopen` for cuBLASLt DLL
Descriptor Caching	GEMM descriptors cached per (M, N, K, dtype)
2.67x Faster	224 matmuls: 395ms → 148ms

# Works with just GPU drivers - no CUDA Toolkit needed
import pygpukit as gk
C = A @ B  # Uses dynamically-loaded cuBLASLt for small batch sizes

CUDA Graph Optimizations

Eliminated GPU allocations in position/random buffer updates
Direct copy_from_numpy for H2D transfers during graph replay

Performance (Qwen3-8B, RTX 3090 Ti)

Mode	Throughput
Standard decode	1.85 tok/s
CUDA Graph	2.12 tok/s

What's New in v0.2.9

Unified LLM Interface

A single CausalTransformerModel now supports multiple architectures through the ModelSpec abstraction.

Architecture	Features	Status
GPT-2	LayerNorm, GELU, Position Embedding	✅ Tested
LLaMA 2/3	RMSNorm, SiLU, RoPE, GQA	✅ Tested
Qwen2/2.5	RMSNorm, SiLU, RoPE, GQA	✅ Tested
Qwen3	RMSNorm, SiLU, RoPE, GQA, QK-Norm	✅ Tested

from pygpukit.llm import load_model_from_safetensors, detect_model_spec, load_safetensors

# Auto-detect and load any supported model
st = load_safetensors("model.safetensors")
spec = detect_model_spec(st.tensor_names)  # Returns GPT2_SPEC, LLAMA_SPEC, or QWEN3_SPEC
model = load_model_from_safetensors("model.safetensors", dtype="float16", spec=spec)

# Generate with KV-cache
output_ids = model.generate(
    input_ids,
    max_new_tokens=64,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    use_cache=True,  # KV-cache for efficient generation
)

Hybrid Attention Execution

Automatic CPU/GPU switching for optimal performance:

Phase	Backend	Reason
Prefill (seq_len > 1)	GPU SDPA	Parallelizable
Decode (seq_len = 1)	CPU	Avoids kernel launch overhead

New LLM Operations

Operation	Description
`gpk.sdpa_causal(q, k, v)`	Scaled Dot-Product Attention with causal mask
`gpk.rope_inplace(x, freqs)`	Rotary Position Embedding (in-place)
`gpk.silu(x)`	SiLU/Swish activation
`gpk.rmsnorm(x, weight, eps)`	RMS Layer Normalization

Sharded Model Support

Load large models split across multiple safetensors files:

from pygpukit.llm import load_safetensors

# Automatically handles sharded models
st = load_safetensors("model.safetensors.index.json")  # Returns ShardedSafeTensorsFile
print(f"Shards: {len(st._shard_files)}, Tensors: {st.num_tensors}")

What's New in v0.2.7

CUTLASS Epilogue Fusion

Fused Linear + Bias + GELU operations using CUTLASS epilogue fusion for improved performance in transformer workloads.

import pygpukit as gpk
import numpy as np

# Create tensors
batch, in_feat, out_feat = 512, 768, 3072
input = gpk.from_numpy(np.random.randn(batch, in_feat).astype(np.float32))
weight = gpk.from_numpy(np.random.randn(out_feat, in_feat).astype(np.float32))
bias = gpk.from_numpy(np.random.randn(out_feat).astype(np.float32))

# Fused linear + bias + GELU (single kernel, no intermediate memory)
output = gpk.linear_bias_gelu(input, weight, bias)

Multi-SM CUTLASS Kernels

Runtime SM detection with architecture-optimized kernel variants:

Architecture	GPU Examples	Pipeline	Features
SM80	A100	4-stage	48KB shared memory
SM86	RTX 3090, RTX 3080	5-stage	100KB shared memory
SM89	RTX 4090, RTX 4080	6-stage	Ada Lovelace optimizations
SM90	H100	CUTLASS 3.x	WGMMA/TMA instructions
SM100/120	Blackwell (B100, B200)	CUTLASS 3.x	Next-gen TensorCore

Note: SM100+ (Blackwell) requires CUDA 13.x. Windows wheels include SM100/120 support.

New Operations

Operation	Description
`gpk.transpose(a)`	GPU-native matrix transpose
`gpk.bias_add_inplace(out, bias)`	In-place bias addition
`gpk.linear_bias_gelu(x, w, b)`	Fused linear + bias + GELU

API Improvements

Complete public API exports (all operations accessible via gpk.*)
Consistent snake_case naming convention
Full docstrings for all public functions

LLM Support

PyGPUkit includes built-in support for loading and running LLM models. See the LLM Guide for detailed documentation.

Important: PyGPUkit's core responsibility is GPU execution, not tokenization.

The model API expects token IDs as input, not raw text
For production tokenization, use HuggingFace tokenizers
The built-in Tokenizer class is experimental and intended for demos only

from pygpukit.llm import SafeTensorsFile, load_model_from_safetensors, detect_model_spec

# Load safetensors (memory-mapped, zero-copy)
st = SafeTensorsFile("model.safetensors")
print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")

# Load model with automatic architecture detection
spec = detect_model_spec(st.tensor_names)
model = load_model_from_safetensors("model.safetensors", dtype="float16", spec=spec)

# Generate with token IDs (use HuggingFace tokenizers for production)
input_ids = [1, 2, 3, 4]  # Your tokenizer's output
output_ids = model.generate(input_ids, max_new_tokens=32)

Component	Description
`SafeTensorsFile`	Memory-mapped .safetensors loading
`CausalTransformerModel`	Unified model for GPT-2, LLaMA, Qwen3
`load_model_from_safetensors`	Load model with auto-detection
`detect_model_spec`	Auto-detect model architecture
`Tokenizer`	Experimental BPE tokenizer (demos only)

What's New in v0.2.6

CUTLASS Backend (Default)

NVIDIA CUTLASS v4.3.0 is now the default GEMM backend, delivering optimized TensorCore performance out of the box.

Feature	Description
TF32 TensorCore	31+ TFLOPS for FP32 inputs (automatic)
FP16 TensorCore	63 TFLOPS
BF16 TensorCore	63 TFLOPS
Zero Config	No environment variables needed

import pygpukit as gpk
import numpy as np

# CUTLASS TF32 is automatic for FP32 (31+ TFLOPS)
a = gpk.from_numpy(np.random.randn(8192, 8192).astype(np.float32))
b = gpk.from_numpy(np.random.randn(8192, 8192).astype(np.float32))
c = a @ b  # Uses CUTLASS TF32 TensorCore

# For full FP32 precision (no TF32), set:
# PYGPUKIT_NO_TF32=1

Multi-LLM Concurrent Execution

Run multiple AI models (LLM, TTS, Vision) concurrently on a single GPU with independent CUDA streams and VRAM budgets.

Feature	Description
Execution Control	User controls execution order
Stream Isolation	No implicit sync between streams
VRAM Budgeting	Safe memory sharing per model
Concurrent Safety	"Running simultaneously doesn't break"
asyncio Integration	Native Python async/await support

Note: On a single GPU, Multi-LLM scheduling enables concurrent execution, not faster execution, for compute-bound workloads. Speedup benefits apply to I/O-bound workloads or multi-GPU setups.

import asyncio
from pygpukit.scheduler import (
    create_context, context_session, GB, initialize
)

# Create execution contexts with VRAM budgets
initialize(device_id=0)
llm_ctx = create_context("llm", max_vram=4 * GB)
tts_ctx = create_context("tts", max_vram=2 * GB)

async def run_parallel():
    async with context_session(llm_ctx), context_session(tts_ctx):
        # Run models concurrently with asyncio.gather
        llm_task = asyncio.create_task(run_llm_inference())
        tts_task = asyncio.create_task(run_tts_synthesis())

        text, audio = await asyncio.gather(llm_task, tts_task)
        return text, audio

result = asyncio.run(run_parallel())

FP16/BF16 TensorCore (via CUTLASS)

Feature	Description
FP16 TensorCore	63 TFLOPS (automatic via CUTLASS)
BF16 TensorCore	63 TFLOPS (automatic via CUTLASS)
FP32 Accumulation	Numerical stability maintained

import pygpukit as gpk
import numpy as np

# FP16 TensorCore matmul (63 TFLOPS on RTX 3090 Ti)
# No environment variable needed - CUTLASS is automatic
a = gpk.from_numpy(np.random.randn(8192, 8192).astype(np.float16))
b = gpk.from_numpy(np.random.randn(8192, 8192).astype(np.float16))
c = a @ b  # Uses CUTLASS TensorCore

Note: CUTLASS requires matrix dimensions divisible by 16.

What's New in v0.2.5

FP16 / BF16 Support

Feature	Description
FP16 (float16)	Half-precision floating point
BF16 (bfloat16)	Brain floating point (better dynamic range)
FP32 Accumulation	Numerical stability via FP32 intermediate
Type Conversion	`astype()` for seamless dtype conversion

import pygpukit as gpk
import numpy as np

# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b  # FP16 matmul

# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16  # BF16 matmul
result = c_bf16.astype(gpk.float32)  # Convert back to FP32

Reduction Operations

Operation	Description
`gpk.sum(a)`	Sum of all elements
`gpk.mean(a)`	Mean of all elements
`gpk.max(a)`	Maximum element

Operator Overloads

c = a + b   # Element-wise add
c = a - b   # Element-wise subtract
c = a * b   # Element-wise multiply
c = a / b   # Element-wise divide
c = a @ b   # Matrix multiplication

What's New in v0.2.4

Single-Binary Distribution

Feature	Description
Driver-only mode	Only `nvcuda.dll` (GPU driver) required
Dynamic NVRTC	JIT loaded at runtime, optional
No cudart dependency	Eliminated CUDA Runtime dependency
Smaller wheel	No bundled DLLs

import pygpukit as gp

# Works with just GPU drivers!
print(f"CUDA: {gp.is_cuda_available()}")      # True (if GPU driver installed)
print(f"NVRTC: {gp.is_nvrtc_available()}")    # True (if CUDA Toolkit installed)
print(f"NVRTC Path: {gp.get_nvrtc_path()}")   # Path to NVRTC DLL (if available)

TF32 TensorCore GEMM

Feature	Description
PTX mma.sync	Direct TensorCore access via inline PTX assembly
cp.async Pipeline	Double-buffered async memory transfers
TF32 Precision	19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error
SM 80+ Required	Ampere architecture (RTX 30XX+) required

Performance

Benchmark Comparison (RTX 3090 Ti, 8192×8192)

Library	FP32	TF32	FP16	BF16	Requirements
NumPy (OpenBLAS)	~0.8 TFLOPS	—	—	—	CPU only
cuBLAS	~21 TFLOPS	~59 TFLOPS	~75 TFLOPS	~83 TFLOPS	CUDA Toolkit
PyGPUkit (CUTLASS)	18 TFLOPS	31 TFLOPS	63 TFLOPS	63 TFLOPS	GPU drivers only

Built-in matmul kernels are pre-compiled. Driver-Only and Full (JIT) modes have identical matmul performance. JIT is only needed for custom kernels.

PyGPUkit Performance by Matrix Size

Matrix Size	FP32 (NO_TF32)	TF32 (CUTLASS)	FP16 (CUTLASS)	BF16 (CUTLASS)
2048×2048	9.6 TFLOPS	13 TFLOPS	15 TFLOPS	21 TFLOPS
4096×4096	14.7 TFLOPS	22 TFLOPS	44 TFLOPS	44 TFLOPS
8192×8192	18 TFLOPS	31 TFLOPS	63 TFLOPS	63 TFLOPS

Note: CUTLASS is automatic for compatible sizes (16-aligned). Use PYGPUKIT_NO_TF32=1 for full FP32 precision.

GEMV Performance (RTX 5090, SM120a)

For LLM decode (M=1), custom GEMV kernels significantly outperform cuBLASLt:

Model Layer	K	N	cuBLASLt	BF16 GEMV	NVF4 GEMV	Memory
Qwen-7B hidden	4096	4096	413us	97us	152us	73% less
Qwen-7B MLP	4096	11008	418us	96us	153us	73% less
Qwen-72B hidden	8192	8192	799us	266us	265us	73% less
Qwen-72B MLP	8192	29568	1603us	375us	454us	73% less

Kernel	Description	Use Case
BF16 GEMV	Custom BF16 kernel optimized for M=1	Speed priority
NVF4 GEMV	4-bit NVF4 weights with block scaling	Memory priority (73% reduction)

Note: For large K (8192+), NVF4 matches BF16 speed while using 73% less memory. Ideal for memory-constrained LLM inference.

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:

Matrix Size	TFLOPS	Notes
4096×4096	68	GPU-side quantization
8192×8192	174	3-stage async pipeline
16384×16384	316	Direct write to user buffer

Note: GPU-side BF16->NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements

Python 3.10+
NVIDIA GPU with drivers installed
CUDA 13.0+ (required for SM120/Blackwell features)
Optional: CUDA Toolkit (for JIT compilation of custom kernels)

Minimum Driver Versions (CUDA 13.x)

Platform	Minimum Driver
Linux	590.44.01 or later
Windows	572.16 or later (Game Ready/Studio)

Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.

Supported GPUs

Generation	Architecture	Examples	Status
Ampere	SM80-86	A100, RTX 3090, RTX 3080	Fully supported
Ada Lovelace	SM89	RTX 4090, RTX 4080	Fully supported
Hopper	SM90	H100, H200	Fully supported
Blackwell	SM100-120	B100, B200, RTX 5090	CUDA 13.0+ required
Turing/Older	SM < 80	RTX 20XX, GTX 10XX	NOT supported

Runtime Modes

Mode	Requirements	Features
Full JIT	GPU drivers + CUDA Toolkit	All features including custom kernels
Pre-compiled	GPU drivers only	Built-in ops (matmul, add, mul)
CPU simulation	None	Testing/development without GPU

Quick Start

Basic Operations

import pygpukit as gp

# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)

# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom JIT Kernel (requires CUDA Toolkit)

src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
'''

if gp.is_nvrtc_available():
    kernel = gp.jit(src, func="scale")
    kernel(x, factor=0.5, n=x.size)
else:
    print("JIT not available. Using pre-compiled ops.")

Rust Scheduler

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Features

Core Infrastructure (Rust)

Feature	Description
Memory Pool	LRU eviction, size-class free lists
Scheduler	Priority queue, memory reservation
Transfer Engine	Separate H2D/D2H streams, priority
Kernel Dispatch	Per-stream limits, lifecycle tracking

Advanced Scheduler

Feature	Description
Admission Control	Deterministic admission, quota enforcement
QoS Policy	Guaranteed/Burstable/BestEffort tiers
Kernel Pacing	Bandwidth-based throttling per stream
GPU Partitioning	Resource isolation, multi-tenant support
Multi-LLM Execution	Concurrent AI model execution with stream isolation
asyncio Integration	Native Python async/await for concurrent inference

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver API, NVRTC)
  rust/            # Rust backend (memory pool, scheduler)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  docs/            # Documentation guides
  examples/        # Demo scripts
  scripts/         # Build scripts, benchmarks
  tests/           # Test suite

Roadmap

Released

Version	Highlights
v0.1	GPUArray, NVRTC JIT, add/mul/matmul, wheels
v0.2.0	Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests
v0.2.1	API stabilization, error propagation
v0.2.2	Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32
v0.2.3	TF32 TensorCore (PTX mma.sync), 28 TFLOPS
v0.2.4	Single-binary distribution, dynamic NVRTC, driver-only mode
v0.2.5	FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS)
v0.2.6	CUTLASS backend (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution
v0.2.7	Epilogue fusion (linear+bias+gelu), Multi-SM kernels, API review
v0.2.8	CUTLASS v4.3.3 update, auto-update workflow
v0.2.9	Unified LLM interface (CausalTransformerModel), ModelSpec abstraction, GPT-2/LLaMA/Qwen3 support
v0.2.10	Dynamic cuBLASLt loading, CUDA Graph optimizations, descriptor caching
v0.2.11	Batch decode (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120)
v0.2.12	Advanced audio processing (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch)
v0.2.15	FP8 I/O GEMM (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis)

Planned

Version	Goals
v0.3	Triton backend, advanced ops (softmax), MPS/MIG

API Stability & Backward Compatibility

Version Policy

v0.2.x: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
v0.3+: May introduce breaking changes with deprecation warnings in prior version.

Stable Public API (v0.2.x)

All functions exported via pygpukit.* are part of the stable public API:

Category	Functions
Factory	`zeros`, `ones`, `empty`, `from_numpy`
Elementwise	`add`, `sub`, `mul`, `div`
Math	`exp`, `log`, `relu`, `gelu`
Matrix	`matmul`, `transpose`
Reductions	`sum`, `mean`, `max`
Neural	`layernorm`, `bias_add_inplace`, `linear_bias_gelu`
Types	`GPUArray`, `DataType`, `float32`, `float64`, `float16`, `bfloat16`
LLM	`llm.SafeTensorsFile`, `llm.CausalTransformerModel`, `llm.load_model_from_safetensors`
LLM (Experimental)	`llm.Tokenizer` (use HuggingFace tokenizers for production)

Deprecation Policy

APIs to be removed will emit DeprecationWarning for at least one minor version before removal.

Contributing

Contributions and discussions are welcome! Please open Issues for feature requests, bugs, or design proposals.

License

MIT License

Acknowledgements

Inspired by: CUDA Runtime, NVRTC, PyCUDA, CuPy, Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.19

Jan 1, 2026

0.2.18

Dec 30, 2025

0.2.17

Dec 28, 2025

0.2.16

Dec 28, 2025

This version

0.2.15

Dec 26, 2025

0.2.14

Dec 23, 2025

0.2.13

Dec 23, 2025

0.2.12

Dec 22, 2025

0.2.11

Dec 22, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 16, 2025

0.2.8

Dec 15, 2025

0.2.7

Dec 15, 2025

0.2.6

Dec 15, 2025

0.2.5

Dec 15, 2025

0.2.4

Dec 14, 2025

0.2.3

Dec 14, 2025

0.2.2

Dec 13, 2025

0.2.0

Dec 12, 2025

0.1.3

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.15.tar.gz (34.4 MB view details)

Uploaded Dec 26, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpukit-0.2.15-cp312-cp312-win_amd64.whl (10.4 MB view details)

Uploaded Dec 26, 2025 CPython 3.12Windows x86-64

pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (10.8 MB view details)

Uploaded Dec 26, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.15.tar.gz.

File metadata

Download URL: pygpukit-0.2.15.tar.gz
Upload date: Dec 26, 2025
Size: 34.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.15.tar.gz
Algorithm	Hash digest
SHA256	`ca0fc26b206bb01f7ec959cefe7265e880dc7b10f07b041f9a0668297b462808`
MD5	`0cf538d148a581da90a490345898cd73`
BLAKE2b-256	`db743ce1148ddacfef0e9eb9aa3790472315dda7b4caaf9252009c937d9a5ea8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.15.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.15.tar.gz
- Subject digest: ca0fc26b206bb01f7ec959cefe7265e880dc7b10f07b041f9a0668297b462808
- Sigstore transparency entry: 779792572
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@072b829c32702421008290e327e545b94dfb2244
- Branch / Tag: refs/tags/v0.2.15
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@072b829c32702421008290e327e545b94dfb2244
- Trigger Event: push

File details

Details for the file pygpukit-0.2.15-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pygpukit-0.2.15-cp312-cp312-win_amd64.whl
Upload date: Dec 26, 2025
Size: 10.4 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.15-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`bd071fb0a332648245b110967341796cdb34b5e438ff9bb279d44a07572dbd06`
MD5	`bc09e5e597c2ba3109b98570f9b48ea4`
BLAKE2b-256	`311e34b3bd05069551c5d8574682dad19369e0cb831cfca7d470cba79eb23fe7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.15-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.15-cp312-cp312-win_amd64.whl
- Subject digest: bd071fb0a332648245b110967341796cdb34b5e438ff9bb279d44a07572dbd06
- Sigstore transparency entry: 779792575
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@072b829c32702421008290e327e545b94dfb2244
- Branch / Tag: refs/tags/v0.2.15
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@072b829c32702421008290e327e545b94dfb2244
- Trigger Event: push

File details

Details for the file pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

Download URL: pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Upload date: Dec 26, 2025
Size: 10.8 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`6d8e363e76a1e1c926ce8227c9db5acdd4628a00e7333ed4d61b86aaf56d2496`
MD5	`b493b6aa7d13bb093c21b83949f598f7`
BLAKE2b-256	`c123b1aa3207d86650b9bb908b3037278a7621951f0be2a937ffe5bac26566cc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.15-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Subject digest: 6d8e363e76a1e1c926ce8227c9db5acdd4628a00e7333ed4d61b86aaf56d2496
- Sigstore transparency entry: 779792573
- Sigstore integration time: Dec 26, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@072b829c32702421008290e327e545b94dfb2244
- Branch / Tag: refs/tags/v0.2.15
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@072b829c32702421008290e327e545b94dfb2244
- Trigger Event: push

PyGPUkit 0.2.15

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyGPUkit — Lightweight GPU Runtime for Python

Documentation

Overview

What's New in v0.2.15

FP8 I/O GEMM (SM120)

Pure NVF4 GEMM (446 TFLOPS)

New Math Operations

uint8/int8 NumPy Support

What's New in v0.2.14

Packaging Fixes

What's New in v0.2.12

GPU Audio Processing (Driver-Only)

Previous Audio Features (v0.2.11)

What's New in v0.2.11

Batch Decode Support

Decode Strategy Framework

CUDA Graph Improvements

Driver API Async Memory Operations

CUDA 13.x Required

RTX 5090 Support

Qwen2 Architecture Support

What's New in v0.2.10

Dynamic cuBLASLt Loading

CUDA Graph Optimizations

Performance (Qwen3-8B, RTX 3090 Ti)

What's New in v0.2.9

Unified LLM Interface

Hybrid Attention Execution

New LLM Operations

Sharded Model Support

What's New in v0.2.7

CUTLASS Epilogue Fusion

Multi-SM CUTLASS Kernels

New Operations

API Improvements

LLM Support

What's New in v0.2.6

CUTLASS Backend (Default)

Multi-LLM Concurrent Execution

FP16/BF16 TensorCore (via CUTLASS)

What's New in v0.2.5

FP16 / BF16 Support

Reduction Operations

Operator Overloads

What's New in v0.2.4

Single-Binary Distribution

TF32 TensorCore GEMM

Performance

Benchmark Comparison (RTX 3090 Ti, 8192×8192)

PyGPUkit Performance by Matrix Size

GEMV Performance (RTX 5090, SM120a)

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

Installation

Requirements

Minimum Driver Versions (CUDA 13.x)

Supported GPUs

Runtime Modes

Quick Start

Basic Operations

Custom JIT Kernel (requires CUDA Toolkit)

Rust Scheduler

Features

Core Infrastructure (Rust)

Advanced Scheduler

Project Goals

Project Structure

Roadmap

Released

Planned

API Stability & Backward Compatibility