Tensor computation library with automatic differentiation
Project description
Polygrad Python
Python bindings for polygrad, a C11 port of tinygrad's compiler core.
Polygrad moves the compiler and runtime out of Python and into a reusable native library that can be called from multiple frontends, including Python, JavaScript, and R.
This package exposes that core as a lazy Tensor API with autograd, neural network layers, compiled training steps, and HuggingFace model loading.
- Project home and docs: https://polygrad.org
- Source code and issue tracker: https://github.com/polygrad/polygrad
Installation
pip install polygrad
Build requirements: A C compiler (gcc or clang) and Python development headers (python3-dev).
Runtime requirement: clang must be on PATH. polygrad compiles compute kernels at runtime.
Package requirements: Python >= 3.9 and numpy.
Platform support: Linux only. The current CPU runtime uses POSIX fork() and dlopen().
PyPI package notes:
float32andfloat64dtypes- tinygrad-style
device=and.to(...) - CPU works by default, CUDA is detected at runtime
download_hf()additionally requireshuggingface_hub
For development:
# Editable install (compiles C sources, auto-syncs from repo)
pip install -e py/
# Or build the shared library manually and point to it
make
export POLYGRAD_LIB=/path/to/build/libpolygrad.so
Quick Start
from polygrad import Tensor
# Create tensors
a = Tensor.rand(3, 4)
b = Tensor.rand(4, 5)
# Matrix multiply + softmax
c = (a @ b).softmax(-1)
print(c.numpy())
# Autograd
x = Tensor([1.0, 2.0, 3.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy()) # [2.0, 4.0, 6.0]
Devices
from polygrad import Device, Tensor
x = Tensor.rand(4)
if Device.cuda_available():
y = (x * 2).to('cuda')
print(y.device) # CUDA
print(y.numpy()) # executes on CUDA, returns host numpy array
else:
y = (x * 2).to('cpu')
print(y.device) # CPU
Training Example
from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters
Tensor.manual_seed(42)
model = Linear(2, 1)
opt = SGD(get_parameters(model), lr=0.01)
for i in range(100):
opt.zero_grad()
x = Tensor([[1.0, 2.0], [3.0, 4.0]])
target = Tensor([[5.0], [11.0]])
loss = (model(x) - target).square().mean()
loss.backward()
opt.step()
print(f"loss: {loss.item():.4f}")
Tensor API
Construction
| Method | Description |
|---|---|
Tensor(data) |
From list, numpy array, or scalar |
Tensor.zeros(*shape) |
Tensor of zeros |
Tensor.ones(*shape) |
Tensor of ones |
Tensor.full(shape, val) |
Tensor filled with value |
Tensor.rand(*shape) |
Uniform random [0, 1) |
Tensor.randn(*shape) |
Standard normal |
Tensor.randint(low, high, shape) |
Random integers [low, high) |
Tensor.arange(stop, start=0, step=1) |
Arithmetic progression |
Tensor.linspace(start, stop, steps) |
Evenly spaced values |
Tensor.eye(n) |
Identity matrix |
Tensor.empty(*shape) |
Uninitialized tensor |
Tensor.manual_seed(seed) |
Set random seed |
Properties
| Property | Type | Description |
|---|---|---|
shape |
tuple | Dimension sizes |
ndim |
int | Number of dimensions |
dtype |
str | 'float32' or 'float64' |
device |
str | Current tensor device ('CPU' or 'CUDA') |
T |
Tensor | Transpose of last two dims |
requires_grad |
bool | Settable; enables autograd |
grad |
Tensor/None | Gradient after .backward() |
Realization & Conversion
| Method | Returns | Description |
|---|---|---|
realize() |
Tensor | Execute lazy graph, return self |
numpy() |
ndarray | Realize and return numpy array |
item() |
float | Scalar value |
tolist() |
list | Nested Python list |
to(device) |
Tensor | Copy tensor view to 'cpu' or 'cuda' |
cpu() / cuda() |
Tensor | Convenience wrappers for to(...) |
numel() |
int | Total elements |
size(dim=None) |
tuple/int | Shape or dimension size |
detach() |
Tensor | Copy without graph |
clone() |
Tensor | Copy preserving requires_grad |
Arithmetic
a + b, a - b, a * b, a / b, -a, a ** b
All support broadcasting and scalar operands.
Comparisons
a < b, a == b, a != b, a > b, a >= b, a <= b
Returns float tensor (1.0 = true, 0.0 = false).
Element-wise Math
| Method | Description |
|---|---|
exp() |
e^x |
log() |
ln(x) |
sqrt() |
Square root |
square() |
x^2 |
abs() |
Absolute value |
sign() |
Sign (-1, 0, +1) |
reciprocal() |
1/x |
rsqrt() |
1/sqrt(x) |
sin(), cos(), tan() |
Trigonometric |
ceil(), floor(), round(), trunc() |
Rounding |
isnan(), isinf() |
NaN/Inf detection |
exp2(), log2() |
Base-2 functions |
where(x, y) |
Conditional: self ? x : y |
maximum(other) |
Element-wise max |
minimum(other) |
Element-wise min |
clamp(min_=None, max_=None) |
Clamp to range |
Activations
| Method | Description |
|---|---|
relu() |
max(0, x) |
relu6() |
clamp(relu(x), 0, 6) |
leaky_relu(neg_slope=0.01) |
Leaky ReLU |
sigmoid() |
1 / (1 + e^-x) |
tanh() |
Hyperbolic tangent |
gelu() |
Gaussian Error Linear Unit |
quick_gelu() |
Fast GELU approximation |
silu() / swish() |
x * sigmoid(x) |
elu(alpha=1.0) |
Exponential Linear Unit |
softplus(beta=1.0) |
log(1 + e^(beta*x)) / beta |
mish() |
x * tanh(softplus(x)) |
hardtanh(min_val=-1, max_val=1) |
Clamped linear |
hardswish() |
Hard swish |
hardsigmoid() |
Hard sigmoid |
Reductions
| Method | Description |
|---|---|
sum(axis=None, keepdim=False) |
Sum along axes |
max(axis=None, keepdim=False) |
Maximum along axes |
min(axis=None, keepdim=False) |
Minimum along axes |
mean(axis=None, keepdim=False) |
Mean along axes |
var(axis=None, keepdim=False, correction=1) |
Variance |
std(axis=None, keepdim=False, correction=1) |
Standard deviation |
Movement / Shape
| Method | Description |
|---|---|
reshape(*shape) / view(*shape) |
Reshape (supports -1) |
permute(*order) |
Permute dimensions |
transpose(dim0=-2, dim1=-1) |
Swap two dimensions |
expand(*shape) |
Broadcast to shape |
squeeze(dim=None) |
Remove size-1 dims |
unsqueeze(dim) |
Add size-1 dim |
flatten(start_dim=0, end_dim=-1) |
Flatten dim range |
unflatten(dim, sizes) |
Split dim into multiple |
shrink(arg) |
Slice: [(start, end), ...] |
pad(arg) |
Pad: [(before, after), ...] |
flip(axis) |
Reverse along axes |
repeat(*repeats) |
Tile tensor |
Linear Algebra
| Method | Description |
|---|---|
matmul(other) / dot(other) / @ |
Matrix multiplication |
linear(weight, bias=None) |
x @ weight.T + bias |
Normalization & Loss
| Method | Description |
|---|---|
softmax(axis=-1) |
Softmax normalization |
log_softmax(axis=-1) |
Log-softmax |
layernorm(axis=-1, eps=1e-5) |
Layer normalization |
cross_entropy(target, axis=-1) |
Cross-entropy loss |
binary_crossentropy(target) |
Binary cross-entropy |
Advanced Operations
| Method | Description |
|---|---|
Tensor.einsum(formula, *operands) |
Einstein summation |
rearrange(formula, **kwargs) |
einops-style rearrange |
Tensor.cat(*tensors, dim=0) |
Concatenate along dim |
Tensor.stack(*tensors, dim=0) |
Stack along new dim |
split(sizes, dim=0) |
Split into chunks |
chunk(n, dim=0) |
Split into n chunks |
__getitem__ |
Indexing: int, slice, None, Ellipsis |
Autograd
x = Tensor([1.0, 2.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy()) # [2.0, 4.0]
Call backward() on a scalar loss before calling item() or numpy() on the loss.
nn Module
Layers
from polygrad.nn import Linear, LayerNorm, RMSNorm, Embedding, Dropout
| Class | Signature | Description |
|---|---|---|
Linear(in_f, out_f, bias=True) |
y = x @ W.T + b | Fully connected layer |
LayerNorm(shape, eps=1e-5) |
(x - mean) / sqrt(var + eps) * w + b | Layer normalization |
RMSNorm(dim, eps=1e-5) |
x / rms(x) * w | Root mean square normalization |
Embedding(vocab, dim) |
Lookup table | Token embedding |
Dropout(p=0.5) |
Random zeroing | Training-only (controlled by Tensor.training) |
GroupNorm(groups, channels) |
Group normalization | Per-group normalization |
Optimizers
from polygrad.nn import SGD, Adam, AdamW, get_parameters
| Class | Signature |
|---|---|
SGD(params, lr=0.01, momentum=0.0, weight_decay=0.0) |
|
Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0) |
|
AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01) |
All optimizers have step() and zero_grad() methods.
State Dict
from polygrad.nn import get_parameters, get_state_dict, load_state_dict
params = get_parameters(model) # List of Tensor
sd = get_state_dict(model) # {'weight': Tensor, 'bias': Tensor, ...}
load_state_dict(model2, sd) # Load params into another model
Compiled Training Steps
Compile a training step into a reusable C program. The first call traces the computation graph; subsequent calls execute with zero scheduling overhead.
from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters, compile_step
Tensor.manual_seed(42)
model = Linear(4, 1)
opt = SGD(get_parameters(model), lr=0.01)
# Sample inputs (shapes must match at runtime)
x = Tensor.rand(8, 4)
y = Tensor.rand(8, 1)
def train_step(model, opt, x, y):
loss = (model(x) - y).square().mean()
loss.backward()
opt.step()
opt.zero_grad()
return loss
# Compile: traces forward + backward + optimizer into one PolyStep
step = compile_step(train_step, model, opt, x, y)
# Run: executes compiled kernels with current buffer data
for i in range(100):
x._data[:] = ... # update input data in-place
y._data[:] = ...
step.run()
print(f"step {i}: loss = {step.loss_value():.4f}")
compile_step returns a CompiledTrainingStep with:
run()-- execute all compiled kernels (forward + backward + optimizer)loss_value()-- read the loss scalar from the output buffern_kernels-- number of compiled kernelsn_intermediates-- number of pre-allocated intermediate buffers
HuggingFace Model Loading
Load pre-trained models directly from HuggingFace format (config.json + safetensors).
To use download_hf(), install the optional Hub client first:
pip install huggingface_hub
from polygrad.hf import load_hf, download_hf, generate
import numpy as np
import json
from pathlib import Path
# Download a small GPT-2 checkpoint from HuggingFace Hub
model_path = download_hf('hf-internal-testing/tiny-random-gpt2')
config = json.loads((Path(model_path) / 'config.json').read_text())
vocab_size = config['vocab_size']
max_seq_len = 16
# Load into a PolyInstance
inst = load_hf(model_path, max_batch=1, max_seq_len=max_seq_len)
# Run forward pass
tokens = np.array([[1, 2, 3, 4]], dtype=np.float32)
outputs = inst.forward(
x=tokens,
positions=np.arange(tokens.shape[1], dtype=np.float32).reshape(1, -1),
arange=np.arange(max_seq_len, dtype=np.float32)
)
logits = outputs['output'].reshape(1, max_seq_len, vocab_size)
# Autoregressive generation
result = generate(inst, tokens, max_new_tokens=2, temperature=1.0, top_k=10)
HF API
| Function | Description |
|---|---|
load_hf(path, max_batch=1, max_seq_len=0) |
Load model from local directory |
load_hf_bytes(config, weights, ...) |
Load from raw bytes (no filesystem) |
download_hf(repo_id, cache_dir=None) |
Download from HuggingFace Hub |
generate(inst, tokens, max_new_tokens, ...) |
Autoregressive text generation |
Supported model types: GPT-2. Weight formats: F32, F16, BF16 safetensors (single or sharded).
How It Works
- Lazy evaluation: Operations build a UOp graph in the C core. No computation happens until
realize(),numpy(),item(), orbackward(). - One FFI call per op: Each Tensor method calls one C function via ctypes. The C core handles all op composition (e.g.,
gelu=0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3)))). - Realize boundaries: Some ops (softmax, layernorm, var) insert implicit
.realize()calls to create kernel boundaries for the scheduler. - Autograd:
backward()calls C'spoly_grad()for each parameter, then realizes the gradient tensors.
Limitations
- CPU is the default path. CUDA execution requires a working CUDA runtime (
libcudaandlibnvrtc) on the host. - Conv2d and BatchNorm are stubs (forward raises NotImplementedError)
Tests
python -m pytest py/tests/ -v # 130 tests (tensor + nn + compiled step + GPT-2 + HF loading + instance)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file polygrad-0.3.0.tar.gz.
File metadata
- Download URL: polygrad-0.3.0.tar.gz
- Upload date:
- Size: 522.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19dc2760a3dced376fc30ff421ae2bdca59d5a16c5549c17e37008fcfcbdc05b
|
|
| MD5 |
16884c120033f8612f649b6224ea1730
|
|
| BLAKE2b-256 |
006697b8e0ce9363d3002578c65c1fdb75c9cc0565b7a0b9d9238335bacd1e3b
|