A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Project description

PyGPUkit — Lightweight GPU Runtime for Python

A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.

When GPU optimizations change your results, something is wrong.

A minimal, deterministic GPU runtime for Python.
Built for people who care about correctness, reproducibility, and real performance.

CUDA Graph that doesn't lie
cuBLASLt without hidden state
FP8 / NVF4 / w8a16 done explicitly
Rust-powered scheduler for real GPU concurrency

This is not a framework. This is a GPU runtime.

Why PyGPUkit Exists

Modern GPU stacks optimize aggressively.
Sometimes, they optimize correctness away.

PyGPUkit exists because:

CUDA Graph replay can change numerical results
cuBLASLt may depend on hidden workspace state
Stream-0 synchronization hides performance bugs
“It’s faster” often means “it’s nondeterministic”

PyGPUkit chooses:

Explicit over implicit
Determinism over magic
Measurable behavior over benchmark-only claims

What PyGPUkit Is NOT

❌ Not a PyTorch replacement
❌ Not a training framework
❌ Not a convenience-first library
❌ Not safe if you ignore GPU semantics
❌ Not designed for "just works" expectations

PyGPUkit is for people who want to see and control what their GPU is actually doing.

Core Capabilities (TL;DR)

🚀 Driver-only deployment (no CUDA Toolkit required)
🧠 Deterministic CUDA Graph execution
⚙️ Explicit stream & memory control
🧮 FP8 / NVF4 / BF16 / TF32 done right
🎛️ Rust-based GPU scheduler with QoS & partitioning
🔊 GPU-native audio & DSP (no cuFFT dependency)

Real-World GPU Pathologies (Observed)

Same input, different output with CUDA Graph replay
FP8 GEMM producing correct averages but wrong tokens
cuBLASLt performance variance across runs
H2D stalls masked by stream-0 synchronization

All of these are reproducible.
All of them are documented.
All of them are why PyGPUkit exists.

These are not theoretical. They were all observed in production or real benchmarks.

Documentation

Guide	Description
Getting Started	Installation, quick start, basic usage
API Reference	Complete API documentation with examples
LLM Guide	SafeTensors, GPT-2/LLaMA/Qwen3 inference
Performance Tuning	TF32, FP16, CUTLASS optimization
Scheduler Guide	Multi-LLM concurrent execution

What's New in v0.2.18

Major Codebase Refactoring

Complete modularization of the codebase for better maintainability:

Split monolithic files into modular .inl components
Reorganized matmul kernel directory structure
Standardized GEMM/GEMV naming conventions
Modular pybind11 bindings

Kokoro-82M TTS

Text-to-speech synthesis with Japanese/English support:

from pygpukit.tts import KokoroModel
model = KokoroModel.from_safetensors("kokoro-v1.0-82m.safetensors")
audio = model.generate("Hello world", voice="af_heart")

Positional Encoding Operations

New neural network operations for attention mechanisms:

Function	Description
`pope_init_encoding`	Sinusoidal positional encoding (PoPE)
`pope_inplace`	Apply additive encoding to Q/K
`alibi_init_slopes`	ALiBi head-specific slopes
`alibi_compute_bias`	ALiBi attention bias matrix
`rope_init_ntk_aware`	NTK-aware RoPE for context extension
`rope_init_yarn`	YaRN dimension-wise interpolation
`rope_init_linear`	Linear position interpolation
`relu2`	ReLU squared activation (Primer)

Unified Benchmark Suite

New scripts/benchmark.py for comprehensive performance testing across all dtypes and sizes.

QAT/Pruning/Sparsity Config

Model config support for quantization-aware training, pruning, and sparsity patterns.

Optimized BF16 GEMV

New optimized BF16 GEMV kernel with B[N,K] layout achieves 98-101% peak bandwidth for typical LLM dimensions:

Matrix	Bandwidth	% of Peak
2048 x 8192	1763 GB/s	98%
4096 x 14336	1810 GB/s	101%

W8A16 GEMM Fix

Fixed MMA A-fragment register mapping for m16n8k16 instruction. MoE models now produce correct output.

What's New in v0.2.17

Triton Backend MVP

Optional Triton backend for rapid kernel prototyping without C++ recompilation:

Component	Description
pygpukit.triton	Triton wrapper module with GPUArray compatibility
TritonArray	Wrapper bridging PyGPUkit GPUArray to Triton
Triton Kernels	RMSNorm, LayerNorm, Softmax, Rotary
Hybrid Execution	Mix Triton + Native CUDA in same model

# Install Triton (Windows)
pip install triton-windows
# Or: pip install pygpukit[triton]

# Hybrid chat example
python examples/chat_cli_triton.py --model /path/to/model --tokenizer /path/to/tokenizer.json

Kernel Routing Example:

RMSNorm  -> Triton (kernels/rmsnorm.py) - easy to modify
MatMul   -> Native CUDA (cuBLASLt) - production performance
SDPA     -> Native CUDA (optimized)
KV Cache -> Native CUDA

Usage Pattern

from pygpukit.triton import from_gpuarray, kernels, triton_available

if triton_available():
    # Wrap GPUArray for Triton
    x_triton = from_gpuarray(x_gpu)
    w_triton = from_gpuarray(weight_gpu)
    out_triton = from_gpuarray(out_gpu)

    # Call Triton kernel
    kernels.rmsnorm(x_triton, w_triton, out_triton, eps=1e-5)

What's New in v0.2.16

MoE (Mixture of Experts) Support

Full support for Mixtral-style MoE models with custom CUDA kernels:

Component	Description
MoE Kernels	TopK routing, softmax, token permutation, gather/scatter
Grouped GEMM	Batched expert dispatch with per-row expert IDs
MoELayer	Python layer with router + expert FFN dispatch
MIXTRAL_SPEC	Auto-detection for Mixtral 8x7B models

Thinking Model Support

Qwen3 Thinking model support with <think>...</think> block parsing.

New GEMV Kernels (SM120)

Kernel	A dtype	B dtype	Speedup vs BF16
FP8/FP8 (W8A8)	FP8 E4M3	FP8 E4M3	6-22x
NVF4/NVF4 (W4A4)	NVF4	NVF4	Memory priority
Int4 GEMV	BF16	Int4	Large K dimensions

New GEMM Kernels (SM120)

Kernel	Description
W8A16 GEMM	FP8 weight + BF16 activation (CUTLASS)
Int8 Native	Exact int8 via dp4a (CUDA cores)
Int4 via Int8	4-bit approximation via TensorCore
Grouped GEMM v2	Per-row expert IDs for MoE

Development Tooling

Claude Code Skills: Build, benchmark, lint, test automation
Subagents: kernel-reviewer, perf-analyzer, api-designer
CONTRIBUTING.md: Contribution guidelines

Previous versions (v0.2.4 - v0.2.15): See CHANGELOG.md for complete release history.

LLM Support

PyGPUkit includes built-in support for loading and running LLM models. See the LLM Guide for detailed documentation.

Important: PyGPUkit's core responsibility is GPU execution, not tokenization.

The model API expects token IDs as input, not raw text
For production tokenization, use HuggingFace tokenizers
The built-in Tokenizer class is experimental and intended for demos only

from pygpukit.llm import SafeTensorsFile, load_model_from_safetensors, detect_model_spec

# Load safetensors (memory-mapped, zero-copy)
st = SafeTensorsFile("model.safetensors")
print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")

# Load model with automatic architecture detection
spec = detect_model_spec(st.tensor_names)
model = load_model_from_safetensors("model.safetensors", dtype="float16", spec=spec)

# Generate with token IDs (use HuggingFace tokenizers for production)
input_ids = [1, 2, 3, 4]  # Your tokenizer's output
output_ids = model.generate(input_ids, max_new_tokens=32)

Component	Description
`SafeTensorsFile`	Memory-mapped .safetensors loading
`CausalTransformerModel`	Unified model for GPT-2, LLaMA, Qwen3
`load_model_from_safetensors`	Load model with auto-detection
`detect_model_spec`	Auto-detect model architecture
`Tokenizer`	Experimental BPE tokenizer (demos only)

Performance

RTX 5090 Benchmark (SM120a, CUDA 13.1)

Standard Precision (8192x8192)

Precision	TFLOPS	Notes
FP32	80	CUDA cores
TF32	87	TensorCore
FP16	170	TensorCore
BF16	173	TensorCore

Quantized GEMM (M=8192, K=4096, N=14336)

Format	TFLOPS	Error	Notes
FP8xFP8	217	~0.1%	CUTLASS SM120 blockwise
W8A16	50	~0.1%	FP8 weight, BF16 activation
Int8 (via FP8)	142	~3.5%	TensorCore approximation
Int8 (dp4a)	44	0%	Exact, CUDA cores
Int4 (via Int8)	121	~0.1%	TensorCore approximation

NVF4 (4-bit NormalFloat) GEMM

Matrix Size	TFLOPS	Notes
8192x8192	261	Pre-quantized
12288x12288	383	3-stage pipeline
16384x16384	446	Peak performance

Note: NVF4xNVF4 achieves 4x memory bandwidth reduction vs BF16 with minimal accuracy loss.

RTX 3090 Ti Benchmark (SM86)

Matrix Size	FP32	TF32	FP16	BF16
2048×2048	9.6 TFLOPS	13 TFLOPS	15 TFLOPS	21 TFLOPS
4096×4096	14.7 TFLOPS	22 TFLOPS	44 TFLOPS	44 TFLOPS
8192×8192	18 TFLOPS	31 TFLOPS	63 TFLOPS	63 TFLOPS

Note: CUTLASS is automatic for compatible sizes (16-aligned). Use PYGPUKIT_NO_TF32=1 for full FP32 precision.

GEMV Performance (RTX 5090, SM120a)

For LLM decode (M=1), custom GEMV kernels for different quantization formats:

GEMV Bandwidth Utilization (v0.2.18)

Optimized BF16 GEMV achieves near-peak memory bandwidth for large matrices:

K	N	BF16 BW	BF16 %	W8A16 BW	W8A16 %
2048	2048	434 GB/s	24%	278 GB/s	16%
2048	8192	1763 GB/s	98%	434 GB/s	24%
8192	2048	543 GB/s	30%	363 GB/s	20%
4096	14336	1810 GB/s	101%	467 GB/s	26%

Note: BF16 GEMV with optimized B[N,K] layout achieves 98-101% peak bandwidth for typical LLM FFN dimensions. W8A16 (FP8 weight) includes dequantization overhead.

GEMV Latency by Layer

Layer	K	N	BF16	W8A16	W8A8	W4A16	W4A4	Int4
Qwen-7B hidden	4096	4096	31 us	108 us	31 us	142 us	252 us	33 us
Qwen-7B MLP up	4096	14336	100 us	272 us	43 us	140 us	253 us	49 us
Qwen-7B MLP down	14336	4096	102 us	330 us	46 us	403 us	873 us	59 us
Qwen-72B hidden	8192	8192	112 us	326 us	46 us	246 us	497 us	51 us
Qwen-72B MLP up	8192	29568	324 us	976 us	180 us	448 us	509 us	111 us
Qwen-72B MLP down	29568	8192	839 us	—	204 us	1395 us	1294 us	125 us

Kernel	Format	Memory	Rel. Err (vs FP32)	Best For
BF16	A:BF16, B:BF16	100%	~0.6%	Baseline (highest accuracy)
W8A16	A:BF16, B:FP8	50%	~12%	Balanced speed/memory
W8A8	A:FP8, B:FP8	50%	~9%	Speed priority (6-18x faster)
W4A16	A:BF16, B:NVF4	25%	~15%	Memory priority
W4A4	A:NVF4, B:NVF4	12.5%	~20%	Maximum compression
Int4	A:BF16, B:Int4	25%	~15%	Large K dimensions

Note: W8A8 (FP8/FP8) is fastest for typical sizes. W4A4 has 2x dequant overhead (both A and B). Int4 excels at very large K (29568+). W8A16 has K size limit (~16K).

GEMV Quantization Trade-offs (Explicit)

Why is W4A16 faster than NVF4/NVF4 despite both using 4-bit weights?

Kernel	A (Activation)	B (Weight)	Dequant Work	Speed
W4A16	BF16 (native)	NVF4 (4-bit)	1x (B only)	104 us
NVF4/NVF4	NVF4 (4-bit)	NVF4 (4-bit)	2x (A + B)	219 us

Per Scale Block (32 elements):

Operation	W4A16	NVF4/NVF4
Scale load	1 (B)	2 (A + B)
Scale decode (LUT)	1	2
Pre-scaled LUT build	16 mul	16 mul

Per Element:

Operation	W4A16	NVF4/NVF4
A conversion	BF16->float (free)	LUT lookup
B conversion	LUT lookup	LUT lookup

Conclusion: NVF4/NVF4 trades speed for memory. Use when:

Memory-constrained (A is 4x smaller)
Batch inference with large A tensors

For single-token decode (M=1), W4A16 or FP8 is recommended.

Comprehensive GEMV Benchmark (RTX 5090, SM120a)

All GEMV kernels compared on Qwen2.5-7B gate_proj (K=3584, N=18944):

Kernel	A dtype	B dtype	Weight Size	Time (us)	vs BF16
BF16	BF16	BF16	129.5 MB	121	1.00x
FP8/BF16 (W8A16)	BF16	FP8	64.8 MB	275	0.44x
FP8/FP8 (W8A8)	FP8	FP8	64.8 MB	19	6.2x
NVF4/BF16 (W4A16)	BF16	NVF4	32.4 MB	125	0.97x
NVF4/NVF4 (W4A4)	NVF4	NVF4	32.4 MB	241	0.50x

Performance by Layer Type:

Layer	K	N	Best Kernel	Speedup
gate_proj	3584	18944	FP8/FP8	6.2x
down_proj	18944	3584	FP8/FP8	21.6x
o_proj	3584	3584	FP8/FP8	6.8x
qkv_proj	3584	512	FP8/FP8	8.7x

Recommendation: FP8/FP8 is optimal for SM120 (Blackwell). NVF4/BF16 (W4A16) provides the best balance when FP8 compute is unavailable.

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:

Matrix Size	NVF4xBF16	NVF4xNVF4	Notes
4096×4096	64 TFLOPS	87 TFLOPS	GPU-side quantization
8192×8192	168 TFLOPS	261 TFLOPS	3-stage async pipeline
16384×16384	—	446 TFLOPS	Peak performance

Note: GPU-side BF16->NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.

Installation

pip install pygpukit

From source:

git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .

Requirements

Python 3.10+
NVIDIA GPU with drivers installed
CUDA 13.0+ (required for SM120/Blackwell features)
Optional: CUDA Toolkit (for JIT compilation of custom kernels)

Minimum Driver Versions (CUDA 13.x)

Platform	Minimum Driver
Linux	590.44.01 or later
Windows	572.16 or later (Game Ready/Studio)

Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.

Supported GPUs

Generation	Architecture	Examples	Status
Ampere	SM80-86	A100, RTX 3090, RTX 3080	Fully supported
Ada Lovelace	SM89	RTX 4090, RTX 4080	Fully supported
Hopper	SM90	H100, H200	Fully supported
Blackwell	SM100-120	B100, B200, RTX 5090	CUDA 13.0+ required
Turing/Older	SM < 80	RTX 20XX, GTX 10XX	NOT supported

Runtime Modes

Mode	Requirements	Features
Full JIT	GPU drivers + CUDA Toolkit	All features including custom kernels
Pre-compiled	GPU drivers only	Built-in ops (matmul, add, mul)
CPU simulation	None	Testing/development without GPU

Quick Start

Basic Operations

import pygpukit as gp

# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")

# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)

# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)

Custom JIT Kernel (requires CUDA Toolkit)

src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
'''

if gp.is_nvrtc_available():
    kernel = gp.jit(src, func="scale")
    kernel(x, factor=0.5, n=x.size)
else:
    print("JIT not available. Using pre-compiled ops.")

Rust Scheduler

import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))

Features

Core Infrastructure (Rust)

Feature	Description
Memory Pool	LRU eviction, size-class free lists
Scheduler	Priority queue, memory reservation
Transfer Engine	Separate H2D/D2H streams, priority
Kernel Dispatch	Per-stream limits, lifecycle tracking

Advanced Scheduler

Feature	Description
Admission Control	Deterministic admission, quota enforcement
QoS Policy	Guaranteed/Burstable/BestEffort tiers
Kernel Pacing	Bandwidth-based throttling per stream
GPU Partitioning	Resource isolation, multi-tenant support
Multi-LLM Execution	Concurrent AI model execution with stream isolation
asyncio Integration	Native Python async/await for concurrent inference

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Project Structure

PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver API, NVRTC)
  rust/            # Rust backend (memory pool, scheduler)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  .claude/         # Claude Code configuration
    skills/          # Development workflow skills
    agents/          # Specialized subagents
  docs/            # Documentation guides
  examples/        # Demo scripts
  scripts/         # Build scripts, benchmarks
  tests/           # Test suite

Roadmap

Released

Version	Highlights
v0.1	GPUArray, NVRTC JIT, add/mul/matmul, wheels
v0.2.0	Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests
v0.2.1	API stabilization, error propagation
v0.2.2	Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32
v0.2.3	TF32 TensorCore (PTX mma.sync), 28 TFLOPS
v0.2.4	Single-binary distribution, dynamic NVRTC, driver-only mode
v0.2.5	FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS)
v0.2.6	CUTLASS backend (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution
v0.2.7	Epilogue fusion (linear+bias+gelu), Multi-SM kernels, API review
v0.2.8	CUTLASS v4.3.3 update, auto-update workflow
v0.2.9	Unified LLM interface (CausalTransformerModel), ModelSpec abstraction, GPT-2/LLaMA/Qwen3 support
v0.2.10	Dynamic cuBLASLt loading, CUDA Graph optimizations, descriptor caching
v0.2.11	Batch decode (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120)
v0.2.12	Advanced audio processing (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch)
v0.2.15	FP8 I/O GEMM (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis)
v0.2.16	MoE support (Mixtral), Thinking models (Qwen3), W8A8/W4A4 GEMV, W8A16/Int8/Int4 GEMM, Kernel restructure
v0.2.17	Triton backend MVP, hybrid execution (Triton + Native CUDA), TritonArray wrapper
v0.2.18	Codebase refactoring, Kokoro TTS, Positional encoding (PoPE/ALiBi/YaRN/NTK), ReLU², Unified benchmark, BF16 GEMV (98% BW), W8A16 fix

Planned

Version	Goals
v0.3	Advanced Triton ops (attention), MPS/MIG

API Stability & Backward Compatibility

Version Policy

v0.2.x: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
v0.3+: May introduce breaking changes with deprecation warnings in prior version.

Stable Public API (v0.2.x)

All functions exported via pygpukit.* are part of the stable public API:

Category	Functions
Factory	`zeros`, `ones`, `empty`, `from_numpy`
Elementwise	`add`, `sub`, `mul`, `div`, `neg`, `abs`, `clamp`, `where`
Math	`exp`, `log`, `sqrt`, `rsqrt`, `sin`, `cos`, `tanh`, `sigmoid`, `relu`, `gelu`, `softmax`
Matrix	`matmul`, `transpose`
Reductions	`sum`, `sum_axis`, `mean`, `max`, `min`, `argmax`
Neural	`layernorm`, `rmsnorm`, `silu`, `sdpa_causal`, `rope_inplace`, `bias_add_inplace`, `linear_bias_gelu`
Types	`GPUArray`, `DataType`, `float32`, `float64`, `float16`, `bfloat16`, `int32`, `int64`, `int8`, `uint8`
LLM	`llm.SafeTensorsFile`, `llm.CausalTransformerModel`, `llm.load_model_from_safetensors`
LLM (Experimental)	`llm.Tokenizer` (use HuggingFace tokenizers for production)

Deprecation Policy

APIs to be removed will emit DeprecationWarning for at least one minor version before removal.

Contributing

See CONTRIBUTING.md for guidelines.

Quick Start:

Fork and clone
Create feature branch
Build: ./build.sh 86 (Git Bash)
Run checks: ruff check, mypy, pytest
Submit PR

We Accept: Performance improvements, bug fixes, new GPU ops, documentation We Reject: cuda-python dependencies, training features, SM < 80 support

License

MIT License

Acknowledgements

Inspired by and built upon:

NVIDIA CUDA Toolkit - Runtime, Driver API, NVRTC
CUTLASS - TensorCore GEMM optimization techniques
Codon - High-performance Python compiler with GPU support
CuPy
Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.

If this project saved you from a silent GPU bug, or helped you trust your results again, consider giving it a ⭐.

Correctness deserves visibility.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

m96-chan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.19

Jan 1, 2026

This version

0.2.18

Dec 30, 2025

0.2.17

Dec 28, 2025

0.2.16

Dec 28, 2025

0.2.15

Dec 26, 2025

0.2.14

Dec 23, 2025

0.2.13

Dec 23, 2025

0.2.12

Dec 22, 2025

0.2.11

Dec 22, 2025

0.2.10

Dec 18, 2025

0.2.9

Dec 16, 2025

0.2.8

Dec 15, 2025

0.2.7

Dec 15, 2025

0.2.6

Dec 15, 2025

0.2.5

Dec 15, 2025

0.2.4

Dec 14, 2025

0.2.3

Dec 14, 2025

0.2.2

Dec 13, 2025

0.2.0

Dec 12, 2025

0.1.3

Dec 12, 2025

0.1.1

Dec 12, 2025

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygpukit-0.2.18.tar.gz (34.5 MB view details)

Uploaded Dec 30, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygpukit-0.2.18-cp312-cp312-win_amd64.whl (12.0 MB view details)

Uploaded Dec 30, 2025 CPython 3.12Windows x86-64

pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl (12.4 MB view details)

Uploaded Dec 30, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64manylinux: glibc 2.35+ x86-64

File details

Details for the file pygpukit-0.2.18.tar.gz.

File metadata

Download URL: pygpukit-0.2.18.tar.gz
Upload date: Dec 30, 2025
Size: 34.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.18.tar.gz
Algorithm	Hash digest
SHA256	`ed677ddb622aad8f345ccf446f74fad2f8871c518790a6562e3d8858667b76d1`
MD5	`2705810f1452f566ff6e52cbe533dbea`
BLAKE2b-256	`99eeed1e9a14c401341e6bada0ff6d834d6b415bc902e02e9f28bf1a79c93d2d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.18.tar.gz:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.18.tar.gz
- Subject digest: ed677ddb622aad8f345ccf446f74fad2f8871c518790a6562e3d8858667b76d1
- Sigstore transparency entry: 782570284
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179
- Branch / Tag: refs/tags/v0.2.18
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@428c3bd24c96f4f22de36ee70e2717729acea179
- Trigger Event: push

File details

Details for the file pygpukit-0.2.18-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pygpukit-0.2.18-cp312-cp312-win_amd64.whl
Upload date: Dec 30, 2025
Size: 12.0 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.18-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`256059b207f05c817c97600e13d7efc584e1c9d821fd0a7bf01a16e25c7ed6ba`
MD5	`a9a1021cb3216c6ff4bb443ddbe3549b`
BLAKE2b-256	`66fb717fcd3bfa2d9376e2d3625a4e79e373ce7aa7323cb104569fab08b0ed03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.18-cp312-cp312-win_amd64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.18-cp312-cp312-win_amd64.whl
- Subject digest: 256059b207f05c817c97600e13d7efc584e1c9d821fd0a7bf01a16e25c7ed6ba
- Sigstore transparency entry: 782570294
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179
- Branch / Tag: refs/tags/v0.2.18
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@428c3bd24c96f4f22de36ee70e2717729acea179
- Trigger Event: push

File details

Details for the file pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.

File metadata

Download URL: pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Upload date: Dec 30, 2025
Size: 12.4 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`7bdcdcd8b0aa3c001435b9a3bdf3ccd3043d7c778e8ea48a72ced87f5c53f91a`
MD5	`a67be06473aae00921abc6c9b57ba23d`
BLAKE2b-256	`b18d3ea2e34bd6740279915e2657f3c8bd448c949dadcf00b3d9216fe9a4d74c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:

Publisher: release.yml on m96-chan/PyGPUkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Subject digest: 7bdcdcd8b0aa3c001435b9a3bdf3ccd3043d7c778e8ea48a72ced87f5c53f91a
- Sigstore transparency entry: 782570289
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179
- Branch / Tag: refs/tags/v0.2.18
- Owner: https://github.com/m96-chan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@428c3bd24c96f4f22de36ee70e2717729acea179
- Trigger Event: push

PyGPUkit 0.2.18

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PyGPUkit — Lightweight GPU Runtime for Python

When GPU optimizations change your results, something is wrong.

This is not a framework. This is a GPU runtime.

Why PyGPUkit Exists

What PyGPUkit Is NOT

Core Capabilities (TL;DR)

Real-World GPU Pathologies (Observed)

Documentation

What's New in v0.2.18

Major Codebase Refactoring

Kokoro-82M TTS

Positional Encoding Operations

Unified Benchmark Suite

QAT/Pruning/Sparsity Config

Optimized BF16 GEMV

W8A16 GEMM Fix

What's New in v0.2.17

Triton Backend MVP

Usage Pattern

What's New in v0.2.16

MoE (Mixture of Experts) Support

Thinking Model Support

New GEMV Kernels (SM120)

New GEMM Kernels (SM120)

Development Tooling

LLM Support

Performance

RTX 5090 Benchmark (SM120a, CUDA 13.1)

Standard Precision (8192x8192)

Quantized GEMM (M=8192, K=4096, N=14336)

NVF4 (4-bit NormalFloat) GEMM

RTX 3090 Ti Benchmark (SM86)

GEMV Performance (RTX 5090, SM120a)

GEMV Bandwidth Utilization (v0.2.18)

GEMV Latency by Layer

GEMV Quantization Trade-offs (Explicit)

Comprehensive GEMV Benchmark (RTX 5090, SM120a)

NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

Installation

Requirements

Minimum Driver Versions (CUDA 13.x)

Supported GPUs

Runtime Modes

Quick Start

Basic Operations

Custom JIT Kernel (requires CUDA Toolkit)

Rust Scheduler

Features

Core Infrastructure (Rust)

Advanced Scheduler

Project Goals

Project Structure

Roadmap

Released

Planned

API Stability & Backward Compatibility

Version Policy

Stable Public API (v0.2.x)

Deprecation Policy

Contributing

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers