A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API
Project description
PyGPUkit — Lightweight GPU Runtime for Python
A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.
When GPU optimizations change your results, something is wrong.
A minimal, deterministic GPU runtime for Python.
Built for people who care about correctness, reproducibility, and real performance.
- CUDA Graph that doesn't lie
- cuBLASLt without hidden state
- FP8 / NVF4 / w8a16 done explicitly
- Rust-powered scheduler for real GPU concurrency
This is not a framework. This is a GPU runtime.
Why PyGPUkit Exists
Modern GPU stacks optimize aggressively.
Sometimes, they optimize correctness away.
PyGPUkit exists because:
- CUDA Graph replay can change numerical results
- cuBLASLt may depend on hidden workspace state
- Stream-0 synchronization hides performance bugs
- “It’s faster” often means “it’s nondeterministic”
PyGPUkit chooses:
- Explicit over implicit
- Determinism over magic
- Measurable behavior over benchmark-only claims
What PyGPUkit Is NOT
- ❌ Not a PyTorch replacement
- ❌ Not a training framework
- ❌ Not a convenience-first library
- ❌ Not safe if you ignore GPU semantics
- ❌ Not designed for "just works" expectations
PyGPUkit is for people who want to see and control what their GPU is actually doing.
Core Capabilities (TL;DR)
- 🚀 Driver-only deployment (no CUDA Toolkit required)
- 🧠 Deterministic CUDA Graph execution
- ⚙️ Explicit stream & memory control
- 🧮 FP8 / NVF4 / BF16 / TF32 done right
- 🎛️ Rust-based GPU scheduler with QoS & partitioning
- 🔊 GPU-native audio & DSP (no cuFFT dependency)
Real-World GPU Pathologies (Observed)
- Same input, different output with CUDA Graph replay
- FP8 GEMM producing correct averages but wrong tokens
- cuBLASLt performance variance across runs
- H2D stalls masked by stream-0 synchronization
All of these are reproducible.
All of them are documented.
All of them are why PyGPUkit exists.
These are not theoretical. They were all observed in production or real benchmarks.
Documentation
| Guide | Description |
|---|---|
| Getting Started | Installation, quick start, basic usage |
| API Reference | Complete API documentation with examples |
| LLM Guide | SafeTensors, GPT-2/LLaMA/Qwen3 inference |
| Performance Tuning | TF32, FP16, CUTLASS optimization |
| Scheduler Guide | Multi-LLM concurrent execution |
What's New in v0.2.18
Major Codebase Refactoring
Complete modularization of the codebase for better maintainability:
- Split monolithic files into modular
.inlcomponents - Reorganized matmul kernel directory structure
- Standardized GEMM/GEMV naming conventions
- Modular pybind11 bindings
Kokoro-82M TTS
Text-to-speech synthesis with Japanese/English support:
from pygpukit.tts import KokoroModel
model = KokoroModel.from_safetensors("kokoro-v1.0-82m.safetensors")
audio = model.generate("Hello world", voice="af_heart")
Positional Encoding Operations
New neural network operations for attention mechanisms:
| Function | Description |
|---|---|
pope_init_encoding |
Sinusoidal positional encoding (PoPE) |
pope_inplace |
Apply additive encoding to Q/K |
alibi_init_slopes |
ALiBi head-specific slopes |
alibi_compute_bias |
ALiBi attention bias matrix |
rope_init_ntk_aware |
NTK-aware RoPE for context extension |
rope_init_yarn |
YaRN dimension-wise interpolation |
rope_init_linear |
Linear position interpolation |
relu2 |
ReLU squared activation (Primer) |
Unified Benchmark Suite
New scripts/benchmark.py for comprehensive performance testing across all dtypes and sizes.
QAT/Pruning/Sparsity Config
Model config support for quantization-aware training, pruning, and sparsity patterns.
Optimized BF16 GEMV
New optimized BF16 GEMV kernel with B[N,K] layout achieves 98-101% peak bandwidth for typical LLM dimensions:
| Matrix | Bandwidth | % of Peak |
|---|---|---|
| 2048 x 8192 | 1763 GB/s | 98% |
| 4096 x 14336 | 1810 GB/s | 101% |
W8A16 GEMM Fix
Fixed MMA A-fragment register mapping for m16n8k16 instruction. MoE models now produce correct output.
What's New in v0.2.17
Triton Backend MVP
Optional Triton backend for rapid kernel prototyping without C++ recompilation:
| Component | Description |
|---|---|
| pygpukit.triton | Triton wrapper module with GPUArray compatibility |
| TritonArray | Wrapper bridging PyGPUkit GPUArray to Triton |
| Triton Kernels | RMSNorm, LayerNorm, Softmax, Rotary |
| Hybrid Execution | Mix Triton + Native CUDA in same model |
# Install Triton (Windows)
pip install triton-windows
# Or: pip install pygpukit[triton]
# Hybrid chat example
python examples/chat_cli_triton.py --model /path/to/model --tokenizer /path/to/tokenizer.json
Kernel Routing Example:
RMSNorm -> Triton (kernels/rmsnorm.py) - easy to modify
MatMul -> Native CUDA (cuBLASLt) - production performance
SDPA -> Native CUDA (optimized)
KV Cache -> Native CUDA
Usage Pattern
from pygpukit.triton import from_gpuarray, kernels, triton_available
if triton_available():
# Wrap GPUArray for Triton
x_triton = from_gpuarray(x_gpu)
w_triton = from_gpuarray(weight_gpu)
out_triton = from_gpuarray(out_gpu)
# Call Triton kernel
kernels.rmsnorm(x_triton, w_triton, out_triton, eps=1e-5)
What's New in v0.2.16
MoE (Mixture of Experts) Support
Full support for Mixtral-style MoE models with custom CUDA kernels:
| Component | Description |
|---|---|
| MoE Kernels | TopK routing, softmax, token permutation, gather/scatter |
| Grouped GEMM | Batched expert dispatch with per-row expert IDs |
| MoELayer | Python layer with router + expert FFN dispatch |
| MIXTRAL_SPEC | Auto-detection for Mixtral 8x7B models |
Thinking Model Support
Qwen3 Thinking model support with <think>...</think> block parsing.
New GEMV Kernels (SM120)
| Kernel | A dtype | B dtype | Speedup vs BF16 |
|---|---|---|---|
| FP8/FP8 (W8A8) | FP8 E4M3 | FP8 E4M3 | 6-22x |
| NVF4/NVF4 (W4A4) | NVF4 | NVF4 | Memory priority |
| Int4 GEMV | BF16 | Int4 | Large K dimensions |
New GEMM Kernels (SM120)
| Kernel | Description |
|---|---|
| W8A16 GEMM | FP8 weight + BF16 activation (CUTLASS) |
| Int8 Native | Exact int8 via dp4a (CUDA cores) |
| Int4 via Int8 | 4-bit approximation via TensorCore |
| Grouped GEMM v2 | Per-row expert IDs for MoE |
Development Tooling
- Claude Code Skills: Build, benchmark, lint, test automation
- Subagents: kernel-reviewer, perf-analyzer, api-designer
- CONTRIBUTING.md: Contribution guidelines
Previous versions (v0.2.4 - v0.2.15): See CHANGELOG.md for complete release history.
LLM Support
PyGPUkit includes built-in support for loading and running LLM models. See the LLM Guide for detailed documentation.
Important: PyGPUkit's core responsibility is GPU execution, not tokenization.
- The model API expects token IDs as input, not raw text
- For production tokenization, use HuggingFace tokenizers
- The built-in
Tokenizerclass is experimental and intended for demos only
from pygpukit.llm import SafeTensorsFile, load_model_from_safetensors, detect_model_spec
# Load safetensors (memory-mapped, zero-copy)
st = SafeTensorsFile("model.safetensors")
print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")
# Load model with automatic architecture detection
spec = detect_model_spec(st.tensor_names)
model = load_model_from_safetensors("model.safetensors", dtype="float16", spec=spec)
# Generate with token IDs (use HuggingFace tokenizers for production)
input_ids = [1, 2, 3, 4] # Your tokenizer's output
output_ids = model.generate(input_ids, max_new_tokens=32)
| Component | Description |
|---|---|
SafeTensorsFile |
Memory-mapped .safetensors loading |
CausalTransformerModel |
Unified model for GPT-2, LLaMA, Qwen3 |
load_model_from_safetensors |
Load model with auto-detection |
detect_model_spec |
Auto-detect model architecture |
Tokenizer |
Experimental BPE tokenizer (demos only) |
Performance
RTX 5090 Benchmark (SM120a, CUDA 13.1)
Standard Precision (8192x8192)
| Precision | TFLOPS | Notes |
|---|---|---|
| FP32 | 80 | CUDA cores |
| TF32 | 87 | TensorCore |
| FP16 | 170 | TensorCore |
| BF16 | 173 | TensorCore |
Quantized GEMM (M=8192, K=4096, N=14336)
| Format | TFLOPS | Error | Notes |
|---|---|---|---|
| FP8xFP8 | 217 | ~0.1% | CUTLASS SM120 blockwise |
| W8A16 | 50 | ~0.1% | FP8 weight, BF16 activation |
| Int8 (via FP8) | 142 | ~3.5% | TensorCore approximation |
| Int8 (dp4a) | 44 | 0% | Exact, CUDA cores |
| Int4 (via Int8) | 121 | ~0.1% | TensorCore approximation |
NVF4 (4-bit NormalFloat) GEMM
| Matrix Size | TFLOPS | Notes |
|---|---|---|
| 8192x8192 | 261 | Pre-quantized |
| 12288x12288 | 383 | 3-stage pipeline |
| 16384x16384 | 446 | Peak performance |
Note: NVF4xNVF4 achieves 4x memory bandwidth reduction vs BF16 with minimal accuracy loss.
RTX 3090 Ti Benchmark (SM86)
| Matrix Size | FP32 | TF32 | FP16 | BF16 |
|---|---|---|---|---|
| 2048×2048 | 9.6 TFLOPS | 13 TFLOPS | 15 TFLOPS | 21 TFLOPS |
| 4096×4096 | 14.7 TFLOPS | 22 TFLOPS | 44 TFLOPS | 44 TFLOPS |
| 8192×8192 | 18 TFLOPS | 31 TFLOPS | 63 TFLOPS | 63 TFLOPS |
Note: CUTLASS is automatic for compatible sizes (16-aligned). Use
PYGPUKIT_NO_TF32=1for full FP32 precision.
GEMV Performance (RTX 5090, SM120a)
For LLM decode (M=1), custom GEMV kernels for different quantization formats:
GEMV Bandwidth Utilization (v0.2.18)
Optimized BF16 GEMV achieves near-peak memory bandwidth for large matrices:
| K | N | BF16 BW | BF16 % | W8A16 BW | W8A16 % |
|---|---|---|---|---|---|
| 2048 | 2048 | 434 GB/s | 24% | 278 GB/s | 16% |
| 2048 | 8192 | 1763 GB/s | 98% | 434 GB/s | 24% |
| 8192 | 2048 | 543 GB/s | 30% | 363 GB/s | 20% |
| 4096 | 14336 | 1810 GB/s | 101% | 467 GB/s | 26% |
Note: BF16 GEMV with optimized B[N,K] layout achieves 98-101% peak bandwidth for typical LLM FFN dimensions. W8A16 (FP8 weight) includes dequantization overhead.
GEMV Latency by Layer
| Layer | K | N | BF16 | W8A16 | W8A8 | W4A16 | W4A4 | Int4 |
|---|---|---|---|---|---|---|---|---|
| Qwen-7B hidden | 4096 | 4096 | 31 us | 108 us | 31 us | 142 us | 252 us | 33 us |
| Qwen-7B MLP up | 4096 | 14336 | 100 us | 272 us | 43 us | 140 us | 253 us | 49 us |
| Qwen-7B MLP down | 14336 | 4096 | 102 us | 330 us | 46 us | 403 us | 873 us | 59 us |
| Qwen-72B hidden | 8192 | 8192 | 112 us | 326 us | 46 us | 246 us | 497 us | 51 us |
| Qwen-72B MLP up | 8192 | 29568 | 324 us | 976 us | 180 us | 448 us | 509 us | 111 us |
| Qwen-72B MLP down | 29568 | 8192 | 839 us | — | 204 us | 1395 us | 1294 us | 125 us |
| Kernel | Format | Memory | Rel. Err (vs FP32) | Best For |
|---|---|---|---|---|
| BF16 | A:BF16, B:BF16 | 100% | ~0.6% | Baseline (highest accuracy) |
| W8A16 | A:BF16, B:FP8 | 50% | ~12% | Balanced speed/memory |
| W8A8 | A:FP8, B:FP8 | 50% | ~9% | Speed priority (6-18x faster) |
| W4A16 | A:BF16, B:NVF4 | 25% | ~15% | Memory priority |
| W4A4 | A:NVF4, B:NVF4 | 12.5% | ~20% | Maximum compression |
| Int4 | A:BF16, B:Int4 | 25% | ~15% | Large K dimensions |
Note: W8A8 (FP8/FP8) is fastest for typical sizes. W4A4 has 2x dequant overhead (both A and B). Int4 excels at very large K (29568+). W8A16 has K size limit (~16K).
GEMV Quantization Trade-offs (Explicit)
Why is W4A16 faster than NVF4/NVF4 despite both using 4-bit weights?
| Kernel | A (Activation) | B (Weight) | Dequant Work | Speed |
|---|---|---|---|---|
| W4A16 | BF16 (native) | NVF4 (4-bit) | 1x (B only) | 104 us |
| NVF4/NVF4 | NVF4 (4-bit) | NVF4 (4-bit) | 2x (A + B) | 219 us |
Per Scale Block (32 elements):
| Operation | W4A16 | NVF4/NVF4 |
|---|---|---|
| Scale load | 1 (B) | 2 (A + B) |
| Scale decode (LUT) | 1 | 2 |
| Pre-scaled LUT build | 16 mul | 16 mul |
Per Element:
| Operation | W4A16 | NVF4/NVF4 |
|---|---|---|
| A conversion | BF16->float (free) | LUT lookup |
| B conversion | LUT lookup | LUT lookup |
Conclusion: NVF4/NVF4 trades speed for memory. Use when:
- Memory-constrained (A is 4x smaller)
- Batch inference with large A tensors
For single-token decode (M=1), W4A16 or FP8 is recommended.
Comprehensive GEMV Benchmark (RTX 5090, SM120a)
All GEMV kernels compared on Qwen2.5-7B gate_proj (K=3584, N=18944):
| Kernel | A dtype | B dtype | Weight Size | Time (us) | vs BF16 |
|---|---|---|---|---|---|
| BF16 | BF16 | BF16 | 129.5 MB | 121 | 1.00x |
| FP8/BF16 (W8A16) | BF16 | FP8 | 64.8 MB | 275 | 0.44x |
| FP8/FP8 (W8A8) | FP8 | FP8 | 64.8 MB | 19 | 6.2x |
| NVF4/BF16 (W4A16) | BF16 | NVF4 | 32.4 MB | 125 | 0.97x |
| NVF4/NVF4 (W4A4) | NVF4 | NVF4 | 32.4 MB | 241 | 0.50x |
Performance by Layer Type:
| Layer | K | N | Best Kernel | Speedup |
|---|---|---|---|---|
| gate_proj | 3584 | 18944 | FP8/FP8 | 6.2x |
| down_proj | 18944 | 3584 | FP8/FP8 | 21.6x |
| o_proj | 3584 | 3584 | FP8/FP8 | 6.8x |
| qkv_proj | 3584 | 512 | FP8/FP8 | 8.7x |
Recommendation: FP8/FP8 is optimal for SM120 (Blackwell). NVF4/BF16 (W4A16) provides the best balance when FP8 compute is unavailable.
NVF4-BF16 GEMM Performance (RTX 5090, SM120a)
4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:
| Matrix Size | NVF4xBF16 | NVF4xNVF4 | Notes |
|---|---|---|---|
| 4096×4096 | 64 TFLOPS | 87 TFLOPS | GPU-side quantization |
| 8192×8192 | 168 TFLOPS | 261 TFLOPS | 3-stage async pipeline |
| 16384×16384 | — | 446 TFLOPS | Peak performance |
Note: GPU-side BF16->NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.
Installation
pip install pygpukit
From source:
git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .
Requirements
- Python 3.10+
- NVIDIA GPU with drivers installed
- CUDA 13.0+ (required for SM120/Blackwell features)
- Optional: CUDA Toolkit (for JIT compilation of custom kernels)
Minimum Driver Versions (CUDA 13.x)
| Platform | Minimum Driver |
|---|---|
| Linux | 590.44.01 or later |
| Windows | 572.16 or later (Game Ready/Studio) |
Note: NVRTC (NVIDIA Runtime Compiler) is included in CUDA Toolkit. Pre-compiled GPU operations (matmul, add, mul, etc.) work with just GPU drivers.
Supported GPUs
| Generation | Architecture | Examples | Status |
|---|---|---|---|
| Ampere | SM80-86 | A100, RTX 3090, RTX 3080 | Fully supported |
| Ada Lovelace | SM89 | RTX 4090, RTX 4080 | Fully supported |
| Hopper | SM90 | H100, H200 | Fully supported |
| Blackwell | SM100-120 | B100, B200, RTX 5090 | CUDA 13.0+ required |
| Turing/Older | SM < 80 | RTX 20XX, GTX 10XX | NOT supported |
Runtime Modes
| Mode | Requirements | Features |
|---|---|---|
| Full JIT | GPU drivers + CUDA Toolkit | All features including custom kernels |
| Pre-compiled | GPU drivers only | Built-in ops (matmul, add, mul) |
| CPU simulation | None | Testing/development without GPU |
Quick Start
Basic Operations
import pygpukit as gp
# Allocate arrays
x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")
# Operations
z = gp.add(x, y)
w = gp.matmul(x, y)
# CPU <-> GPU transfer
arr = z.to_numpy()
garr = gp.from_numpy(arr)
Custom JIT Kernel (requires CUDA Toolkit)
src = '''
extern "C" __global__
void scale(float* x, float factor, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) x[idx] *= factor;
}
'''
if gp.is_nvrtc_available():
kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)
else:
print("JIT not available. Using pre-compiled ops.")
Rust Scheduler
import _pygpukit_rust as rust
# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)
# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)
# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
rust.PartitionLimits().memory(4*1024**3).compute(0.5))
Features
Core Infrastructure (Rust)
| Feature | Description |
|---|---|
| Memory Pool | LRU eviction, size-class free lists |
| Scheduler | Priority queue, memory reservation |
| Transfer Engine | Separate H2D/D2H streams, priority |
| Kernel Dispatch | Per-stream limits, lifecycle tracking |
Advanced Scheduler
| Feature | Description |
|---|---|
| Admission Control | Deterministic admission, quota enforcement |
| QoS Policy | Guaranteed/Burstable/BestEffort tiers |
| Kernel Pacing | Bandwidth-based throttling per stream |
| GPU Partitioning | Resource isolation, multi-tenant support |
| Multi-LLM Execution | Concurrent AI model execution with stream isolation |
| asyncio Integration | Native Python async/await for concurrent inference |
Project Goals
- Provide the smallest usable GPU runtime for Python
- Expose GPU scheduling (bandwidth, memory, partitioning)
- Make writing custom GPU kernels easy
- Serve as a building block for inference engines, DSP systems, and real-time workloads
Project Structure
PyGPUkit/
src/pygpukit/ # Python API (NumPy-compatible)
native/ # C++ backend (CUDA Driver API, NVRTC)
rust/ # Rust backend (memory pool, scheduler)
pygpukit-core/ # Pure Rust core logic
pygpukit-python/ # PyO3 bindings
.claude/ # Claude Code configuration
skills/ # Development workflow skills
agents/ # Specialized subagents
docs/ # Documentation guides
examples/ # Demo scripts
scripts/ # Build scripts, benchmarks
tests/ # Test suite
Roadmap
Released
| Version | Highlights |
|---|---|
| v0.1 | GPUArray, NVRTC JIT, add/mul/matmul, wheels |
| v0.2.0 | Rust scheduler (QoS, partitioning), memory pool (LRU), 106 tests |
| v0.2.1 | API stabilization, error propagation |
| v0.2.2 | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
| v0.2.3 | TF32 TensorCore (PTX mma.sync), 28 TFLOPS |
| v0.2.4 | Single-binary distribution, dynamic NVRTC, driver-only mode |
| v0.2.5 | FP16/BF16 support, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |
| v0.2.6 | CUTLASS backend (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution |
| v0.2.7 | Epilogue fusion (linear+bias+gelu), Multi-SM kernels, API review |
| v0.2.8 | CUTLASS v4.3.3 update, auto-update workflow |
| v0.2.9 | Unified LLM interface (CausalTransformerModel), ModelSpec abstraction, GPT-2/LLaMA/Qwen3 support |
| v0.2.10 | Dynamic cuBLASLt loading, CUDA Graph optimizations, descriptor caching |
| v0.2.11 | Batch decode (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120) |
| v0.2.12 | Advanced audio processing (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch) |
| v0.2.15 | FP8 I/O GEMM (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis) |
| v0.2.16 | MoE support (Mixtral), Thinking models (Qwen3), W8A8/W4A4 GEMV, W8A16/Int8/Int4 GEMM, Kernel restructure |
| v0.2.17 | Triton backend MVP, hybrid execution (Triton + Native CUDA), TritonArray wrapper |
| v0.2.18 | Codebase refactoring, Kokoro TTS, Positional encoding (PoPE/ALiBi/YaRN/NTK), ReLU², Unified benchmark, BF16 GEMV (98% BW), W8A16 fix |
Planned
| Version | Goals |
|---|---|
| v0.3 | Advanced Triton ops (attention), MPS/MIG |
API Stability & Backward Compatibility
Version Policy
- v0.2.x: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
- v0.3+: May introduce breaking changes with deprecation warnings in prior version.
Stable Public API (v0.2.x)
All functions exported via pygpukit.* are part of the stable public API:
| Category | Functions |
|---|---|
| Factory | zeros, ones, empty, from_numpy |
| Elementwise | add, sub, mul, div, neg, abs, clamp, where |
| Math | exp, log, sqrt, rsqrt, sin, cos, tanh, sigmoid, relu, gelu, softmax |
| Matrix | matmul, transpose |
| Reductions | sum, sum_axis, mean, max, min, argmax |
| Neural | layernorm, rmsnorm, silu, sdpa_causal, rope_inplace, bias_add_inplace, linear_bias_gelu |
| Types | GPUArray, DataType, float32, float64, float16, bfloat16, int32, int64, int8, uint8 |
| LLM | llm.SafeTensorsFile, llm.CausalTransformerModel, llm.load_model_from_safetensors |
| LLM (Experimental) | llm.Tokenizer (use HuggingFace tokenizers for production) |
Deprecation Policy
APIs to be removed will emit DeprecationWarning for at least one minor version before removal.
Contributing
See CONTRIBUTING.md for guidelines.
Quick Start:
- Fork and clone
- Create feature branch
- Build:
./build.sh 86(Git Bash) - Run checks:
ruff check,mypy,pytest - Submit PR
We Accept: Performance improvements, bug fixes, new GPU ops, documentation We Reject: cuda-python dependencies, training features, SM < 80 support
License
MIT License
Acknowledgements
Inspired by and built upon:
- NVIDIA CUDA Toolkit - Runtime, Driver API, NVRTC
- CUTLASS - TensorCore GEMM optimization techniques
- Codon - High-performance Python compiler with GPU support
- CuPy
- Triton
PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.
If this project saved you from a silent GPU bug, or helped you trust your results again, consider giving it a ⭐.
Correctness deserves visibility.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygpukit-0.2.18.tar.gz.
File metadata
- Download URL: pygpukit-0.2.18.tar.gz
- Upload date:
- Size: 34.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed677ddb622aad8f345ccf446f74fad2f8871c518790a6562e3d8858667b76d1
|
|
| MD5 |
2705810f1452f566ff6e52cbe533dbea
|
|
| BLAKE2b-256 |
99eeed1e9a14c401341e6bada0ff6d834d6b415bc902e02e9f28bf1a79c93d2d
|
Provenance
The following attestation bundles were made for pygpukit-0.2.18.tar.gz:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.18.tar.gz -
Subject digest:
ed677ddb622aad8f345ccf446f74fad2f8871c518790a6562e3d8858667b76d1 - Sigstore transparency entry: 782570284
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179 -
Branch / Tag:
refs/tags/v0.2.18 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@428c3bd24c96f4f22de36ee70e2717729acea179 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.18-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: pygpukit-0.2.18-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 12.0 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
256059b207f05c817c97600e13d7efc584e1c9d821fd0a7bf01a16e25c7ed6ba
|
|
| MD5 |
a9a1021cb3216c6ff4bb443ddbe3549b
|
|
| BLAKE2b-256 |
66fb717fcd3bfa2d9376e2d3625a4e79e373ce7aa7323cb104569fab08b0ed03
|
Provenance
The following attestation bundles were made for pygpukit-0.2.18-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.18-cp312-cp312-win_amd64.whl -
Subject digest:
256059b207f05c817c97600e13d7efc584e1c9d821fd0a7bf01a16e25c7ed6ba - Sigstore transparency entry: 782570294
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179 -
Branch / Tag:
refs/tags/v0.2.18 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@428c3bd24c96f4f22de36ee70e2717729acea179 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl.
File metadata
- Download URL: pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl
- Upload date:
- Size: 12.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bdcdcd8b0aa3c001435b9a3bdf3ccd3043d7c778e8ea48a72ced87f5c53f91a
|
|
| MD5 |
a67be06473aae00921abc6c9b57ba23d
|
|
| BLAKE2b-256 |
b18d3ea2e34bd6740279915e2657f3c8bd448c949dadcf00b3d9216fe9a4d74c
|
Provenance
The following attestation bundles were made for pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl:
Publisher:
release.yml on m96-chan/PyGPUkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pygpukit-0.2.18-cp312-cp312-manylinux_2_34_x86_64.manylinux_2_35_x86_64.whl -
Subject digest:
7bdcdcd8b0aa3c001435b9a3bdf3ccd3043d7c778e8ea48a72ced87f5c53f91a - Sigstore transparency entry: 782570289
- Sigstore integration time:
-
Permalink:
m96-chan/PyGPUkit@428c3bd24c96f4f22de36ee70e2717729acea179 -
Branch / Tag:
refs/tags/v0.2.18 - Owner: https://github.com/m96-chan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@428c3bd24c96f4f22de36ee70e2717729acea179 -
Trigger Event:
push
-
Statement type: