Experimental PyTorch-like autograd engine with an optional Vulkan compute backend (Raspberry Pi 5-focused).
Project description
rasptorch
rasptorch is an experimental deep learning library inspired by PyTorch, with a specific goal: make training and running neural networks practical on the Raspberry Pi 5 while taking advantage of its GPU.
The project has two main parts:
- A small, NumPy-backed autograd engine and
nnmodule that runs on the Raspberry Pi CPU. - An experimental Vulkan-based backend, wired through a
deviceAPI (Tensor(..., device="cpu"|"gpu").to("gpu")), meant to offload core tensor operations (elementwise ops, matmul, activations, reductions, etc.) to the Pi 5's GPU via Vulkan compute.
The Vulkan backend is implemented with real Vulkan compute shaders (GLSL
compiled to SPIR-V). It supports a small but useful set of kernels (see
rasptorch/shaders/ for the authoritative list).
High-level highlights:
- Elementwise ops:
+,*,-,neg,relu,gelu,silu,leaky_relu,elu(plus scalar variants) - Matmul:
@(tiled/shared-memory shader) - Tensor helpers: indexing/slicing via
tensor[...],unsqueeze,squeeze,permute,transpose,flatten,max,min,argmax,argmin,cat,stack,split,chunk - Reductions:
sum,mean(global and axis-based on CPU; global on GPU) - Common broadcast forms:
(N,M) + (M,)and(N,M) * (M,) - Losses:
cross_entropy,binary_cross_entropy,binary_cross_entropy_with_logits,nll_loss,smooth_l1_loss,label_smoothing_cross_entropy - Optimizers and training utilities:
SGD,Adam,AdamW,RMSProp, LR schedulers, gradient clipping, regularization helpers - NN essentials: GPU row-wise
softmax/log_softmax, 2DLayerNorm,BatchNorm1d,BatchNorm2d,Embedding,MultiheadAttention,MaxPool2d,AvgPool2d, andGRU
Performance notes:
- The fastest path is compute-only: keep tensors on GPU and avoid per-iteration
.numpy()/ readbacks. - Fusing ops and reusing output buffers can make the Vulkan path faster than NumPy for certain workloads on Raspberry Pi 5.
See main.py for a simple training example and gpu_demo.py for a
focused correctness + benchmark suite for the Vulkan backend.
Demos
- Essentials demo (softmax/log_softmax, LayerNorm, Dropout,
no_grad,detach):- CPU:
uv run essentials_demo.py --device cpu - GPU-autograd (Vulkan):
uv run essentials_demo.py --device gpu
- CPU:
Note: the Vulkan-backed GPU path requires working Vulkan drivers and glslc (shader compiler)
on your PATH. Low-level backend helpers can still fall back to NumPy in some environments,
but main.py --device gpu is expected to run with Vulkan (glslc available) and fails clearly if it is not.
Modes
There are three execution modes exposed via main.py --device ...:
cpu: NumPy autograd engine (PyTorch-like, runs on CPU)gpu: explicit Vulkan training path (forward + backward + SGD via purpose-built kernels)gpu-autograd: experimental GPU autograd (builds a graph on GPU for a growing but still incomplete set of ops)
Quickstart
- Use a virtual environment for best results (e.g.
.venv,.venv + uv,.venv + poetry).
Installation
From PyPI (CPU-only):
pip install rasptorch
GPU (Pi 5 Vulkan):
pip install "rasptorch[gpu]"
Optional (for PyTorch bridge features and interop tooling):
pip install "rasptorch[torch]"
Dev/test:
pip install -e ".[dev]"
Notes for GPU mode:
- Requires working Vulkan drivers on your system.
- Requires
glslc(shader compiler) available on PATH (for raspberry pi users).
Quick GPU validation:
uv run gpu_demo.py --smoke-only(- Initializes Vulkan strictly and runs fast correctness checks for core kernels.
- If this fails,
uv run main.py --device gpuwill also fail.
Quick model saving check:
uv run main.py --device cpu --epochs 1 --save model.pth- Works without
torch; checkpoints are saved in a rasptorch NumPy archive format. - Inspect keys:
python -c "from rasptorch.checkpoint import load_checkpoint; print(load_checkpoint('model.pth').keys())"
- Works without
For local development from this repo:
pip install -e .
CLI Quickstart
rasptorch includes an agent-native CLI for tensor operations, model management, and training:
# To start it
-- uv run rasptorch chat (uv)
-- rasptorch chat (.venv or standalone python)
# GUI (Streamlit-based)
-- uv run rasptorch ui (uv)
-- rasptorch ui (.venv or standalone python)
-- Access the UI at http://localhost:8501
# Show available commands
rasptorch --help (if using venv alone: rasptorch --help , if using uv: uv run rasptorch --help [follows for all commands in the markdown files])
# chat
python rasptorch chat (you can type help once the cli is up and running in chat mode)
# Create tensors
python rasptorch tensor random --shape 2,3,4
python rasptorch tensor zeros --shape 3,4
# Manage models
python rasptorch model list
python rasptorch model create-linear --input-size 10 --hidden-sizes "32,16" --output-size 2
python rasptorch model convert-legacy --src legacy_model.pth --dst converted_model.pth
# JSON output for scripting/agents
python rasptorch --json tensor random --shape 2,3,4
See rasptorch/rasptorch CLI.md for complete CLI documentation, including training, saving/loading models, and agent integration.
GPU Training (Vulkan)
There are currently two “modes” of training in this repo:
- CPU autograd training (PyTorch-like): uses the NumPy-backed autograd engine.
- Vulkan GPU training (explicit kernels): runs forward + backward + SGD updates on GPU using purpose-built compute shaders.
The Vulkan training path lives in rasptorch/gpu_training.py and currently supports a
2-layer MLP:
Linear -> ReLU -> Linear with MSE loss and SGD.
The general gpu-autograd path supports substantially more model-building pieces than the
explicit gpu trainer, including adaptive optimizers, additional activations, LayerNorm,
BatchNorm, Embedding, and MultiheadAttention.
Run it via:
uv run main.py --device gpu --epochs 50 --batch-size 32 --lr 0.1
Saving weights (.pth / .pt):
uv run main.py --device gpu --epochs 50 --save model.pth- Saved as a rasptorch NumPy archive checkpoint (works without
torch). - Load with
from rasptorch.checkpoint import load_checkpoint.
GPU Autograd (WIP)
There is now an experimental gpu-autograd mode that enables loss.backward() even when
the model and activations live on GPU, for a limited set of ops.
Run it via:
uv run main.py --device gpu-autograd --epochs 50 --batch-size 32 --lr 0.1
Currently supported (GPU) in autograd:
+,*,-(scalar and tensor forms),@(matmul)- scalar ops:
tensor + s,tensor * s,tensor / s, pluss + tensor,s * tensor,s - tensor neg,relu,gelu,silu,leaky_relu,elu,sigmoid,tanh,sum,mean,T(2D transpose)- tensor shape/join helpers:
unsqueeze,squeeze,flatten,permute(common tensors up to 4D),transpose(dim0, dim1),cat,stack,split,chunk - axis-based
sum(axis=...)/mean(axis=...)for 2D tensors withaxis=0/1(GPU-native forward and backward) functional.softmax/functional.log_softmax(2D row-wise,dim=-1/1)nn.LayerNorm(2D inputs, 1Dnormalized_shape;eps=1e-5stays on GPU, otherepsvalues fall back to CPU)nn.BatchNorm1d,nn.BatchNorm2d,nn.Embedding,nn.MultiheadAttentionLinearbackward (GPU grads forweight/bias)SGD.step()updates GPU parameters in-place (SGD + optional momentum/weight decay)Adam.step(),AdamW.step(),RMSProp.step()update GPU parameters in-placefunctional.cross_entropy(logits, target_onehot)(softmax cross-entropy, mean reduction)
Also available on the CPU autograd path:
GRUbackwardTensor.__getitem__/ slicing- axis-based
sum(axis=...)/mean(axis=...) max(axis=...)/min(axis=...)with autogradargmax(axis=...)/argmin(axis=...)nn.MaxPool2d/nn.AvgPool2d
Also available across the library:
- LR schedulers:
StepLR,MultiStepLR,ExponentialLR,CosineAnnealingLR,ReduceLROnPlateau,WarmupScheduler - Initialization helpers:
kaiming_*,xavier_*,orthogonal_,uniform_,normal_,zeros_,ones_,constant_ - Gradient utilities:
clip_grad_norm_,clip_grad_value_,l1_regularization,l2_regularization,total_variation_loss - AMP surface:
rasptorch.amp.autocast()andrasptorch.amp.GradScaler
Tip: rasptorch.no_grad() exists (like PyTorch) to disable graph building during evaluation.
Training Loop Utilities
There is now a small, reusable training loop helper in rasptorch.train that provides
PyTorch-like epoch logs (loss, accuracy/metrics, throughput) for any model.
Key pieces:
rasptorch.train.fit(...): train loop with optional validationrasptorch.train.Accuracy(): top-1 classification accuracyrasptorch.train.classification_target_one_hot(C, device=...): converts integer labels -> one-hot
Example (classifier):
from rasptorch import functional as F
from rasptorch.train import fit, Accuracy, classification_target_one_hot
from rasptorch.optim import SGD
model = ...
opt = SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
fit(
model,
opt,
train_loader,
loss_fn=F.cross_entropy,
device="gpu",
epochs=10,
val_loader=val_loader,
target_transform=classification_target_one_hot(num_classes=10, device="gpu"),
metrics=[Accuracy()],
)
Notes:
- Metrics like accuracy call
.numpy()on logits, which triggers a GPU readback. rasptorch.no_grad()exists; evaluation can avoid building graphs.mse_lossis now implemented purely via tensor ops ((pred-target)^2+mean()), so the loss tensor itself is on GPU ingpu-autogradmode; training code typically reads it back via.numpy()for logging.- Parameters and gradients stay on GPU; loss is read back to CPU for logging.
uv run main.py --device gpunow requires Vulkan. If Vulkan init or shader compilation fails, it raises a clear error instead of silently falling back.- Broadcasting is still limited; common 2D + 1D row-vector forms like
(N,M) + (M,)and(N,M) * (M,)are supported.
PyTorch Bridge (Vulkan-Optimized Inference)
rasptorch.torch_bridge enables efficient inference of PyTorch models with GPU acceleration:
from rasptorch.torch_bridge import convert_torch_model
import torch
# Convert a PyTorch model for GPU inference
torch_model = torch.nn.Sequential(
torch.nn.Linear(10, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 10)
).eval()
rasp_model = convert_torch_model(torch_model, device="gpu")
# Forward pass: compatible layers run on GPU, unsupported layers fall back to CPU
x = torch.randn(32, 10, dtype=torch.float32)
y = rasp_model(x)
Key optimizations:
- Parameter caching: Model weights are cached as GPU buffers at initialization (one-time cost).
- Direct GPU streaming: Tensor data flows CPU → Vulkan GPU without intermediate numpy materialization.
- Zero-copy Vulkan dispatch: Buffers are reused across kernel invocations.
- Automatic layer fallback: Unsupported layers transparently use CPU.
Supported layers: Conv2d, Linear, ReLU, GELU, Sigmoid, Tanh, BatchNorm2d, LayerNorm, MaxPool2d, AvgPool2d, Dropout
Additional APIs
Optimization:
rasptorch.optim:SGD,Adam,AdamW,RMSProprasptorch.optim_sched:StepLR,MultiStepLR,ExponentialLR,CosineAnnealingLR,ReduceLROnPlateau,WarmupScheduler
Initialization:
rasptorch.init:kaiming_uniform_,kaiming_normal_,xavier_uniform_,xavier_normal_,orthogonal_,constant_,zeros_,ones_,uniform_,normal_
Regularization and gradient helpers:
rasptorch.utils:clip_grad_norm_,clip_grad_value_,l1_regularization,l2_regularization,total_variation_loss
Tensor helpers:
Tensor.unsqueeze(),Tensor.squeeze(),Tensor.permute(),Tensor.transpose(),Tensor.flatten()Tensor.split(),Tensor.chunk()rasptorch.cat(...),rasptorch.stack(...)
GPU notes for tensor helpers:
unsqueeze,squeeze, andflattenare view-based on GPUcat,stack,split, andchunknow use Vulkan device-to-device buffer copiespermute/ generaltranspose(dim0, dim1)are Vulkan-native for common tensors up to 4D
More modules:
rasptorch.nn:BatchNorm1d,BatchNorm2d,Embedding,MultiheadAttention,GRU,MaxPool2d,AvgPool2d,GELU,SiLU,LeakyReLU,ELU
Mixed precision surface:
rasptorch.amp.autocast()rasptorch.amp.GradScalerTensor.half()/Tensor.float()
Benchmarks
gpu_demo.py prints timing stats (min/p50/p95/mean/std) for:
- CPU (NumPy)
- GPU compute+readback (includes
.numpy()every iteration) - GPU compute-only (no per-iteration readback)
- GPU fused compute-only and no-alloc variants (preallocated output buffers)
If you want the GPU to win, focus on the compute-only + fused/no-alloc numbers.
Current Limitations
- GPU autograd for core operations (elementwise, matmul, reductions, most activations) is fully implemented and GPU-native.
- GPU-native tensor helper coverage is strongest for practical tensors up to 4D; generic higher-rank permutation is not on a dedicated Vulkan path yet.
GRUautograd is currently CPU-backed; there is no dedicated Vulkan GRU autograd path yet.- The mixed-precision API surface exists, but true fp16 Vulkan storage/compute kernels are not implemented yet.
autocast()andGradScalerare currently preparatory/experimental. - GPU reductions: global
sum()/mean()are GPU-native, and 2D axis reductions (axis=0/1) are GPU-native for both forward and backward (using GPU broadcast operations). For higher-ranksum(axis=...)/mean(axis=...), the forward currently uses CPU (then uploads), but the backward gradient broadcasting is GPU-native. - PyTorch integration is optimized for Vulkan inference:
rasptorch.torch_bridgesupports (Conv2d,Linear,ReLU,BatchNorm2d,MaxPool2d,AvgPool2d,LayerNorm,Sigmoid,Tanh,GELU,Dropout). Model weights are cached as GPU buffers with no per-forward CPU transfers. Unsupported layers fall back to CPU automatically. - Broadcasting: GPU elementwise
+/*support NumPy-style broadcasting up to 8D for forward. Backward gradient reduction for broadcasted operands is GPU-native for ranks up to 8 (the optimized 2D row-vector path remains GPU-native as well).
Development & Tests
pytestruns CPU tests by default.- The backend smoke test runs everywhere:
- With Vulkan available, it exercises real GPU kernels.
- Without Vulkan, it exercises the NumPy fallback path.
- For a strict Vulkan-only check, run
uv run gpu_demo.py --smoke-only.
Publishing (maintainers)
Build:
python -m pip install -U build twinepython -m build
Upload PyPI:
python -m twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rasptorch-3.2.1.tar.gz.
File metadata
- Download URL: rasptorch-3.2.1.tar.gz
- Upload date:
- Size: 199.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9e28c1bbc644b4ccfc56296f6e9f36f366831476b8bc5117a09d478640f61d0
|
|
| MD5 |
318d9c3d64eeb446967406b34a9cfab2
|
|
| BLAKE2b-256 |
941d3bf2335155ba433f976ee984f4fdfb449da4c87c983f5bbcc1ec556a6667
|
File details
Details for the file rasptorch-3.2.1-py3-none-any.whl.
File metadata
- Download URL: rasptorch-3.2.1-py3-none-any.whl
- Upload date:
- Size: 225.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
306c2cb4ebd72e2a335e9bcad50d28847baf1b7fdd315319cdc935a43d5fe4b2
|
|
| MD5 |
ca44fc1543655ddc6fabc30707fbf717
|
|
| BLAKE2b-256 |
85734897403c77edf329f1bdc68fbdb4ca936fe5c7342e439520c1920de42094
|