A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

When GPU optimizations change your results, something is wrong.

A minimal, deterministic GPU runtime for Python.
Built for people who care about correctness, reproducibility, and real performance.

CUDA Graph that doesn't lie
cuBLASLt without hidden state
FP8 / NVF4 / w8a16 done explicitly
Rust-powered scheduler for real GPU concurrency

This is not a framework. This is a GPU runtime.

Why PyGPUkit Exists

Modern GPU stacks optimize aggressively.
Sometimes, they optimize correctness away.

PyGPUkit exists because:

CUDA Graph replay can change numerical results
cuBLASLt may depend on hidden workspace state
Stream-0 synchronization hides performance bugs
“It’s faster” often means “it’s nondeterministic”

PyGPUkit chooses:

Explicit over implicit
Determinism over magic
Measurable behavior over benchmark-only claims

What PyGPUkit Is NOT

❌ Not a PyTorch replacement
❌ Not a training framework
❌ Not a convenience-first library
❌ Not safe if you ignore GPU semantics
❌ Not designed for "just works" expectations

PyGPUkit is for people who want to see and control what their GPU is actually doing.

Core Capabilities (TL;DR)

🚀 Driver-only deployment (no CUDA Toolkit required)
🧠 Deterministic CUDA Graph execution
⚙️ Explicit stream & memory control
🧮 FP8 / NVF4 / BF16 / TF32 done right
🎛️ Rust-based GPU scheduler with QoS & partitioning
🔊 GPU-native audio & DSP (no cuFFT dependency)

Real-World GPU Pathologies (Observed)

Same input, different output with CUDA Graph replay
FP8 GEMM producing correct averages but wrong tokens
cuBLASLt performance variance across runs
H2D stalls masked by stream-0 synchronization

All of these are reproducible.
All of them are documented.
All of them are why PyGPUkit exists.

These are not theoretical. They were all observed in production or real benchmarks.

Documentation

Guide	Description
Getting Started	Installation, quick start, basic usage
API Reference	Complete API documentation with examples
LLM Guide	SafeTensors, GPT-2/LLaMA/Qwen3 inference
Performance Tuning	TF32, FP16, CUTLASS optimization
Scheduler Guide	Multi-LLM concurrent execution

What's New in v0.2.16

MoE (Mixture of Experts) Support

Full support for Mixtral-style MoE models with custom CUDA kernels:

Component	Description
MoE Kernels	TopK routing, softmax, token permutation, gather/scatter
Grouped GEMM	Batched expert dispatch with per-row expert IDs
MoELayer	Python layer with router + expert FFN dispatch
MIXTRAL_SPEC	Auto-detection for Mixtral 8x7B models

from pygpukit.llm import load_model_from_safetensors, detect_model_spec

# Auto-detect MoE model
spec = detect_model_spec(tensor_names)  # Returns MIXTRAL_SPEC for MoE
model = load_model_from_safetensors("mixtral.safetensors", spec=spec)

Thinking Model Support

Qwen3 Thinking model support with <think>...</think> block parsing:

# examples/chat_cli_thinking.py
python examples/chat_cli_thinking.py --model F:/LLM/Qwen3-4B-Thinking

Streaming output with thinking/answer separation
/think command to toggle thinking display
CUDA Graph support for faster decode

New GEMV Kernels (SM120)

Kernel	A dtype	B dtype	Speedup vs BF16
FP8/FP8 (W8A8)	FP8 E4M3	FP8 E4M3	6-22x
NVF4/NVF4 (W4A4)	NVF4	NVF4	Memory priority
Int4 GEMV	BF16	Int4	Large K dimensions

New GEMM Kernels (SM120)

Kernel	Description
W8A16 GEMM	FP8 weight + BF16 activation (CUTLASS)
Int8 Native	Exact int8 via dp4a (CUDA cores)
Int4 via Int8	4-bit approximation via TensorCore
Grouped GEMM v2	Per-row expert IDs for MoE

Kernel Directory Restructure

Organized matmul kernels by {gemm|gemv}/{input}/{output}/{arch}/:

native/ops/matmul/
├── gemm/fp8/bf16/sm120/w8a16_gemm.cu
├── gemm/fp8/fp8/sm120/fp8_cutlass.cu
├── gemv/fp8/fp8/sm120/fp8_gemv.cu
├── gemv/nvf4/nvf4/sm120/nvf4_gemv.cu
└── gemv/int4/int4/sm120/int4_gemv.cu

Development Tooling

Claude Code Skills: Build, benchmark, lint, test automation
Subagents: kernel-reviewer, perf-analyzer, api-designer
CONTRIBUTING.md: Contribution guidelines
MCP Integration: Serena, Context7, Memory servers

Kernel Cleanup

Removed redundant slow kernels:

Removed	Kept	Reason
FP8 GEMV basic	FP8 GEMV opt	[N,K] layout 3-9x faster
Int8 via FP8	Int8 native dp4a	Exact results

What's New in v0.2.15

Whisper ASR Module

Full GPU-accelerated Whisper speech recognition:

Component	Description
WhisperEncoder	Conv1d stem + transformer with GPU attention
WhisperDecoder	Autoregressive decoder with cross-attention
WhisperModel	High-level API: `from_pretrained()`, `transcribe()`
Preprocessing	GPU mel spectrogram (30s pad/trim, normalization)
Streaming	`transcribe_streaming()` for long audio

GEMV Kernels (SM120)

Optimized GEMV for LLM decode (M=1):

Kernel	Feature	Speedup
BF16 GEMV	BF16x2 vectorized loads	25-40% vs scalar
NVF4 GEMV	Pre-scaled LUT	73% less bandwidth
Linear layer	Auto GEMV for M=1	1.3-2.4x vs matmul

FP8 I/O GEMM (SM120)

Pure FP8 input/output GEMM for FP8 model inference (Llama 3.1 FP8, Qwen FP8, etc.):

Function	Description
`matmul_fp8_fp8_sm120`	FP8 E4M3 input -> FP8 E4M3 output (unity scaling)
`matmul_fp8_fp8_blockwise_sm120`	FP8 with block-wise scale_A / scale_B
`fp8_fp8_get_scale_sizes`	Get required scale factor sizes for (M, N, K)
`fp8_fp8_sm120_available`	Check SM120 FP8 I/O availability

import pygpukit as gpk
import numpy as np

# Check availability
if gpk.fp8_fp8_sm120_available():
    # Get scale sizes for blockwise scaling
    sfa_size, sfb_size = gpk.fp8_fp8_get_scale_sizes(M, N, K)

    # Blockwise scaled FP8 GEMM (for real FP8 models)
    scale_a = gpk.from_numpy(np.ones(sfa_size, dtype=np.float32))
    scale_b = gpk.from_numpy(np.ones(sfb_size, dtype=np.float32))
    C = gpk.matmul_fp8_fp8_blockwise_sm120(A_fp8, B_fp8, scale_a, scale_b)

Pure NVF4 GEMM (446 TFLOPS)

GPU-side BF16->NVF4 quantization with 3-stage pipeline for maximum throughput:

Matrix Size	TFLOPS	Notes
8192x8192	261	Branchless vectorized loads
12288x12288	383	3-stage async pipeline
16384x16384	446	Direct write to user buffer

New Math Operations

Extended math operations for GPU computing:

Category	Operations
Trigonometric	`sin`, `cos`
Power/Root	`sqrt`, `rsqrt`
Sign	`abs`, `neg`
Comparison	`clamp`, `where`
Activation	`sigmoid`, `tanh`
Reduction	`argmax`, `min`, `sum_axis`

import pygpukit as gpk

# Trigonometric
y = gpk.sin(x)
y = gpk.cos(x)

# Power operations
y = gpk.sqrt(x)
y = gpk.rsqrt(x)  # 1/sqrt(x)

# Element-wise comparison
y = gpk.clamp(x, min_val=-1.0, max_val=1.0)
y = gpk.where(cond, x, y)  # cond ? x : y

# New activations
y = gpk.sigmoid(x)
y = gpk.tanh(x)

# New reductions
idx = gpk.argmax(x)     # Index of maximum
val = gpk.min(x)        # Minimum value
y = gpk.sum_axis(x, 1)  # Sum along axis

uint8/int8 NumPy Support

from_numpy now supports uint8 and int8 arrays for FP8 data handling:

# FP8 data stored as uint8
fp8_data = np.array([...], dtype=np.uint8)
gpu_fp8 = gpk.from_numpy(fp8_data)

Previous versions (v0.2.4 - v0.2.14): See CHANGELOG.md for complete release history.

LLM Support

PyGPUkit includes built-in support for loading and running LLM models. See the LLM Guide for detailed documentation.

Important: PyGPUkit's core responsibility is GPU execution, not tokenization.

The model API expects token IDs as input, not raw text
For production tokenization, use HuggingFace tokenizers
The built-in Tokenizer class is experimental and intended for demos only

from pygpukit.llm import SafeTensorsFile, load_model_from_safetensors, detect_model_spec

# Load safetensors (memory-mapped, zero-copy)
st = SafeTensorsFile("model.safetensors")
print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")

# Load model with automatic architecture detection
spec = detect_model_spec(st.tensor_names)
model = load_model_from_safetensors("model.safetensors", dtype="float16", spec=spec)

# Generate with token IDs (use HuggingFace tokenizers for production)
input_ids = [1, 2, 3, 4]  # Your tokenizer's output
output_ids = model.generate(input_ids, max_new_tokens=32)

Component	Description
`SafeTensorsFile`	Memory-mapped .safetensors loading
`CausalTransformerModel`	Unified model for GPT-2, LLaMA, Qwen3
`load_model_from_safetensors`	Load model with auto-detection
`detect_model_spec`	Auto-detect model architecture
`Tokenizer`	Experimental BPE tokenizer (demos only)

Performance

RTX 5090 Benchmark (SM120a, CUDA 13.1)

Standard Precision (8192x8192)

Precision	TFLOPS	Notes
FP32	80	CUDA cores
TF32	87	TensorCore
FP16	170	TensorCore
BF16	173	TensorCore

Quantized GEMM (M=8192, K=4096, N=14336)

Format	TFLOPS	Error	Notes
FP8xFP8	217	~0.1%	CUTLASS SM120 blockwise
W8A16	50	~0.1%	FP8 weight, BF16 activation
Int8 (via FP8)	142	~3.5%	TensorCore approximation
Int8 (dp4a)	44	0%	Exact, CUDA cores
Int4 (via Int8)	121	~0.1%	TensorCore approximation

NVF4 (4-bit NormalFloat) GEMM

Matrix Size	TFLOPS	Notes
8192x8192	261	Pre-quantized
12288x12288	383	3-stage pipeline
16384x16384	446	Peak performance

Note: NVF4xNVF4 achieves 4x memory bandwidth reduction vs BF16 with minimal accuracy loss.

RTX 3090 Ti Benchmark (SM86)

Matrix Size	FP32	TF32	FP16	BF16
2048×2048	9.6 TFLOPS	13 TFLOPS	15 TFLOPS	21 TFLOPS
4096×4096	14.7 TFLOPS	22 TFLOPS	44 TFLOPS	44 TFLOPS
8192×8192	18 TFLOPS	31 TFLOPS	63 TFLOPS	63 TFLOPS

Note: CUTLASS is automatic for compatible sizes (16-aligned). Use PYGPUKIT_NO_TF32=1 for full FP32 precision.

GEMV Performance (RTX 5090, SM120a)

For LLM decode (M=1), custom GEMV kernels for different quantization formats:

Layer	K	N	BF16	W8A8	W4A16	Int4
Qwen-7B hidden	4096	4096	65 us	10 us	140 us	31 us
Qwen-7B MLP up	4096	14336	125 us	17 us	141 us	47 us
Qwen-7B MLP down	14336	4096	399 us	22 us	404 us	58 us
Qwen-72B hidden	8192	8192	232 us	21 us	252 us	51 us
Qwen-72B MLP up	8192	29568	324 us	146 us	436 us	112 us
Qwen-72B MLP down	29568	8192	839 us	170 us	1393 us	129 us

Kernel	Format	Memory vs BF16	Best For
BF16 GEMV	A:BF16, B:BF16	100%	Baseline
W8A8 GEMV	A:FP8, B:FP8	50%	Speed priority (6-18x faster)
W4A16 GEMV	A:BF16, B:NVF4	25%	Memory priority
Int4 GEMV	A:BF16, B:Int4	25%	Large K dimensions

Note: W8A8 (FP8/FP8) GEMV is 6-18x faster than BF16 for typical sizes on SM120. Int4 GEMV excels at very large K (29568+) where it matches or beats W8A8.

GEMV Quantization Trade-offs (Explicit)

Why is W4A16 faster than NVF4/NVF4 despite both using 4-bit weights?

Kernel	A (Activation)	B (Weight)	Dequant Work	Speed
W4A16	BF16 (native)	NVF4 (4-bit)	1x (B only)	104 us
NVF4/NVF4	NVF4 (4-bit)	NVF4 (4-bit)	2x (A + B)	219 us

Per Scale Block (32 elements):

Operation	W4A16	NVF4/NVF4
Scale load	1 (B)	2 (A + B)
Scale decode (LUT)	1	2
Pre-scaled LUT build	16 mul	16 mul

Per Element:

Operation	W4A16	NVF4/NVF4
A conversion	BF16->float (free)	LUT lookup
B conversion	LUT lookup	LUT lookup

Conclusion: NVF4/NVF4 trades speed for memory. Use when:

Memory-constrained (A is 4x smaller)
Batch inference with large A tensors

For single-token decode (M=1), W4A16 or FP8 is recommended.

Comprehensive GEMV Benchmark (RTX 5090, SM120a)

All GEMV kernels compared on Qwen2.5-7B gate_proj (K=3584, N=18944):

Kernel	A dtype	B dtype	Weight Size	Time (us)	vs BF16
BF16	BF16	BF16	129.5 MB	119	1.00x
FP8/BF16 (W8A16)	BF16	FP8	64.8 MB	272	0.44x
FP8/FP8 (W8A8)	FP8	FP8	64.8 MB	19	6.2x
NVF4/BF16 (W4A16)	BF16	NVF4	32.4 MB	106	1.12x
NVF4/NVF4 (W4A4)	NVF4	NVF4	32.4 MB	217	0.55x

Performance by Layer Type:

Layer	K	N	Best Kernel	Speedup
gate_proj	3584	18944	FP8/FP8	6.2x
down_proj	18944	3584	FP8/FP8	22.7x
o_proj	3584	3584	FP8/FP8	6.8x
qkv_proj	3584	512	FP8/FP8	9.1x

Recommendation: FP8/FP8 is optimal for SM120 (Blackwell). NVF4/BF16 (W4A16) provides the best balance when FP8 compute is unavailable.

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:

Matrix Size	NVF4xBF16	NVF4xNVF4	Notes
4096×4096	64 TFLOPS	87 TFLOPS	GPU-side quantization
8192×8192	168 TFLOPS	261 TFLOPS	3-stage async pipeline
16384×16384	—	446 TFLOPS	Peak performance

Note: GPU-side BF16->NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements

Python 3.10+
NVIDIA GPU with drivers installed
CUDA 13.0+ (required for SM120/Blackwell features)
Optional: CUDA Toolkit (for JIT compilation of custom kernels)

Minimum Driver Versions (CUDA 13.x)

Platform	Minimum Driver
Linux	590.44.01 or later
Windows	572.16 or later (Game Ready/Studio)

Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.

Supported GPUs

Generation	Architecture	Examples	Status
Ampere	SM80-86	A100, RTX 3090, RTX 3080	Fully supported
Ada Lovelace	SM89	RTX 4090, RTX 4080	Fully supported
Hopper	SM90	H100, H200	Fully supported
Blackwell	SM100-120	B100, B200, RTX 5090	CUDA 13.0+ required
Turing/Older	SM < 80	RTX 20XX, GTX 10XX	NOT supported

Runtime Modes

Mode	Requirements	Features
Full JIT	GPU drivers + CUDA Toolkit	All features including custom kernels
Pre-compiled	GPU drivers only	Built-in ops (matmul, add, mul)
CPU simulation	None	Testing/development without GPU

Quick Start

Basic Operations

import pygpukit as gp

# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)

# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom JIT Kernel (requires CUDA Toolkit)

src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
'''

if gp.is_nvrtc_available():
    kernel = gp.jit(src, func="scale")
    kernel(x, factor=0.5, n=x.size)
else:
    print("JIT not available. Using pre-compiled ops.")

Rust Scheduler

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Features

Core Infrastructure (Rust)

Feature	Description
Memory Pool	LRU eviction, size-class free lists
Scheduler	Priority queue, memory reservation
Transfer Engine	Separate H2D/D2H streams, priority
Kernel Dispatch	Per-stream limits, lifecycle tracking

Advanced Scheduler

Feature	Description
Admission Control	Deterministic admission, quota enforcement
QoS Policy	Guaranteed/Burstable/BestEffort tiers
Kernel Pacing	Bandwidth-based throttling per stream
GPU Partitioning	Resource isolation, multi-tenant support
Multi-LLM Execution	Concurrent AI model execution with stream isolation
asyncio Integration	Native Python async/await for concurrent inference

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver API, NVRTC)
  rust/            # Rust backend (memory pool, scheduler)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  .claude/         # Claude Code configuration
    skills/          # Development workflow skills
    agents/          # Specialized subagents
  docs/            # Documentation guides
  examples/        # Demo scripts
  scripts/         # Build scripts, benchmarks
  tests/           # Test suite

Roadmap

Released

Version	Highlights
v0.1	GPUArray, NVRTC JIT, add/mul/matmul, wheels
v0.2.0	Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests
v0.2.1	API stabilization, error propagation
v0.2.2	Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32
v0.2.3	TF32 TensorCore (PTX mma.sync), 28 TFLOPS
v0.2.4	Single-binary distribution, dynamic NVRTC, driver-only mode
v0.2.5	FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS)
v0.2.6	CUTLASS backend (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution
v0.2.7	Epilogue fusion (linear+bias+gelu), Multi-SM kernels, API review
v0.2.8	CUTLASS v4.3.3 update, auto-update workflow
v0.2.9	Unified LLM interface (CausalTransformerModel), ModelSpec abstraction, GPT-2/LLaMA/Qwen3 support
v0.2.10	Dynamic cuBLASLt loading, CUDA Graph optimizations, descriptor caching
v0.2.11	Batch decode (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120)
v0.2.12	Advanced audio processing (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch)
v0.2.15	FP8 I/O GEMM (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis)
v0.2.16	MoE support (Mixtral), Thinking models (Qwen3), W8A8/W4A4 GEMV, W8A16/Int8/Int4 GEMM, Kernel restructure

Planned

Version	Goals
v0.3	Triton backend, advanced ops (softmax), MPS/MIG

API Stability & Backward Compatibility

Version Policy

v0.2.x: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
v0.3+: May introduce breaking changes with deprecation warnings in prior version.

Stable Public API (v0.2.x)

All functions exported via pygpukit.* are part of the stable public API:

Category	Functions
Factory	`zeros`, `ones`, `empty`, `from_numpy`
Elementwise	`add`, `sub`, `mul`, `div`, `neg`, `abs`, `clamp`, `where`
Math	`exp`, `log`, `sqrt`, `rsqrt`, `sin`, `cos`, `tanh`, `sigmoid`, `relu`, `gelu`, `softmax`
Matrix	`matmul`, `transpose`
Reductions	`sum`, `sum_axis`, `mean`, `max`, `min`, `argmax`
Neural	`layernorm`, `rmsnorm`, `silu`, `sdpa_causal`, `rope_inplace`, `bias_add_inplace`, `linear_bias_gelu`
Types	`GPUArray`, `DataType`, `float32`, `float64`, `float16`, `bfloat16`, `int32`, `int64`, `int8`, `uint8`
LLM	`llm.SafeTensorsFile`, `llm.CausalTransformerModel`, `llm.load_model_from_safetensors`
LLM (Experimental)	`llm.Tokenizer` (use HuggingFace tokenizers for production)

Deprecation Policy

APIs to be removed will emit DeprecationWarning for at least one minor version before removal.

Contributing

See CONTRIBUTING.md for guidelines.

Quick Start:

Fork and clone
Create feature branch
Build: ./build.sh 86 (Git Bash)
Run checks: ruff check, mypy, pytest
Submit PR

We Accept: Performance improvements, bug fixes, new GPU ops, documentation We Reject: cuda-python dependencies, training features, SM < 80 support

License

MIT License

Acknowledgements

Inspired by and built upon:

NVIDIA CUDA Toolkit - Runtime, Driver API, NVRTC
CUTLASS - TensorCore GEMM optimization techniques
Codon - High-performance Python compiler with GPU support
CuPy
Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

If this project saved you from a silent GPU bug, or helped you trust your results again, consider giving it a ⭐.

Correctness deserves visibility.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.19

Jan 1, 2026

0.2.18

Dec 30, 2025

0.2.17

Dec 28, 2025

This version

0.2.16

Dec 28, 2025

0.2.15

Dec 26, 2025

0.2.14

Dec 23, 2025

0.2.13

Dec 23, 2025

0.2.12

Dec 22, 2025

0.2.11

Dec 22, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 16, 2025

0.2.8

Dec 15, 2025

0.2.7

Dec 15, 2025

0.2.6

Dec 15, 2025

0.2.5

Dec 15, 2025

0.2.4

Dec 14, 2025

0.2.3

Dec 14, 2025

0.2.2

Dec 13, 2025

0.2.0

Dec 12, 2025

0.1.3

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.16.tar.gz (34.4 MB view details)

Uploaded Dec 28, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpukit-0.2.16-cp312-cp312-win_amd64.whl (11.6 MB view details)

Uploaded Dec 28, 2025 CPython 3.12Windows x86-64

pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (12.0 MB view details)

Uploaded Dec 28, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.16.tar.gz.

File metadata

Download URL: pygpukit-0.2.16.tar.gz
Upload date: Dec 28, 2025
Size: 34.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.16.tar.gz
Algorithm	Hash digest
SHA256	`59edce697a996c1fad0f3b40bbcf510f56e36052077b855af5b13967c93c4d0d`
MD5	`e553cae9ba7ddf0854140a298b91db04`
BLAKE2b-256	`cfffb0f30475e752048e6f417d97592e9678b50dad6729787b33a39158f38c7a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.16.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.16.tar.gz
- Subject digest: 59edce697a996c1fad0f3b40bbcf510f56e36052077b855af5b13967c93c4d0d
- Sigstore transparency entry: 780671208
- Sigstore integration time: Dec 28, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Branch / Tag: refs/tags/v0.2.16
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Trigger Event: push

File details

Details for the file pygpukit-0.2.16-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pygpukit-0.2.16-cp312-cp312-win_amd64.whl
Upload date: Dec 28, 2025
Size: 11.6 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.16-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`27836fe229c1a9a2ad1c26a09fdd4f6250cbfbd042ac44eb324e8754283acf47`
MD5	`eaeeb606535289bc45ae94bc2aac1216`
BLAKE2b-256	`b9e3dfad0082f9f2da79ca62af09180cd352b9559ab6b6aa4b1399f8243af79c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.16-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.16-cp312-cp312-win_amd64.whl
- Subject digest: 27836fe229c1a9a2ad1c26a09fdd4f6250cbfbd042ac44eb324e8754283acf47
- Sigstore transparency entry: 780671215
- Sigstore integration time: Dec 28, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Branch / Tag: refs/tags/v0.2.16
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Trigger Event: push

File details

Details for the file pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

Download URL: pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Upload date: Dec 28, 2025
Size: 12.0 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`fed5eb9fbb291ce25e78a984f61d2bf961e7f656992992d308b775a47b3d3133`
MD5	`1fc915b33defd942322c4b2fd7095a3d`
BLAKE2b-256	`58a751af745d4b340161819de227a87ca7cd8e03098bd2ef0aa300e56ef59dc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.16-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Subject digest: fed5eb9fbb291ce25e78a984f61d2bf961e7f656992992d308b775a47b3d3133
- Sigstore transparency entry: 780671211
- Sigstore integration time: Dec 28, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Branch / Tag: refs/tags/v0.2.16
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e2e958053a3b2d43a9f93c4cd438fcbccb85f59a
- Trigger Event: push

PyGPUkit 0.2.16

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyGPUkit — Lightweight GPU Runtime for Python

When GPU optimizations change your results, something is wrong.

This is not a framework. This is a GPU runtime.

Why PyGPUkit Exists

What PyGPUkit Is NOT

Core Capabilities (TL;DR)

Real-World GPU Pathologies (Observed)

Documentation

What's New in v0.2.16

MoE (Mixture of Experts) Support

Thinking Model Support

New GEMV Kernels (SM120)

New GEMM Kernels (SM120)

Kernel Directory Restructure

Development Tooling

Kernel Cleanup

What's New in v0.2.15

Whisper ASR Module

GEMV Kernels (SM120)

FP8 I/O GEMM (SM120)

Pure NVF4 GEMM (446 TFLOPS)

New Math Operations

uint8/int8 NumPy Support

LLM Support

Performance

RTX 5090 Benchmark (SM120a, CUDA 13.1)

Standard Precision (8192x8192)

Quantized GEMM (M=8192, K=4096, N=14336)

NVF4 (4-bit NormalFloat) GEMM

RTX 3090 Ti Benchmark (SM86)

GEMV Performance (RTX 5090, SM120a)

GEMV Quantization Trade-offs (Explicit)

Comprehensive GEMV Benchmark (RTX 5090, SM120a)

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

Installation

Requirements

Minimum Driver Versions (CUDA 13.x)

Supported GPUs

Runtime Modes

Quick Start

Basic Operations

Custom JIT Kernel (requires CUDA Toolkit)

Rust Scheduler

Features

Core Infrastructure (Rust)

Advanced Scheduler

Project Goals

Project Structure

Roadmap

Released

Planned

API Stability & Backward Compatibility

Version Policy

Stable Public API (v0.2.x)

Deprecation Policy

Contributing

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions