A small, readable reverse-mode autograd library built on numpy and pyccolo
Project description
pycograd
A small, readable reverse-mode automatic-differentiation library, built on numpy
and pyccolo. Write ordinary numeric Python
— including numpy calls like np.exp, np.dot, np.sum and operators like
@ — and get correct gradients, with no special "autodiff namespace."
The transform API is modelled on JAX: grad,
vmap, and jvp are function-to-function transforms you compose freely, and a
program can be captured into an inspectable, optimizable graph — a typed SSA form
much like a JAX jaxpr. The difference is that pycograd is small enough to read
in an afternoon, and it differentiates the numpy you already write rather than a
look-alike array API.
Install
pip install pycograd
Quickstart
Hand any numpy function to grad / value_and_grad; the array argument is lifted
onto the tape for you.
import numpy as np
from pycograd import value_and_grad
def f(x):
return np.sum(np.sin(x * x)) # ordinary numpy -- and it differentiates
x = np.array([0.5, 1.0, 1.5])
value, (g,) = value_and_grad(f)(x)
# g == 2 * x * cos(x * x)
Composable transforms
The transforms are borrowed from JAX, and like JAX's they compose. grad and
value_and_grad differentiate; vmap vectorizes a function written for one
example over a whole batch in a single pass; jvp (with jacfwd / jacrev)
gives forward-mode and Jacobians. Composing vmap with grad yields something a
plain batched backward cannot — the gradient of each example separately,
stacked over the batch (what gradient clipping and DP-SGD need):
import numpy as np
from pycograd import grad, vmap, cross_entropy
rng = np.random.default_rng(0)
N = 64
w = rng.standard_normal((2, 3)) # shared weights ...
b = rng.standard_normal(3) # ... and bias
X = rng.standard_normal((N, 2)) # N points, each (2,)
Y = np.eye(3)[rng.integers(0, 3, N)] # N one-hot labels, each (3,)
def per_example_loss(w, b, x, y): # one (2,) point + one label -> scalar
return cross_entropy(x @ w + b, y)
# in_axes maps over X and Y, holds w and b shared:
gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
# gw: (N, 2, 3) gb: (N, 3) -- one gradient per example
# their batch-mean is exactly the ordinary full-batch gradient
relu, softmax, cross_entropy, layer_norm, and scaled-dot-product
attention ship as first-class, finite-difference-checked ops, so models stay
plain numpy and the transforms see straight through them.
Inspecting the graph
A numpy function can be captured into a graph instead of run — the same idea as
a JAX jaxpr. capture records the forward, value_and_grad differentiates it into
a combined forward+backward graph, and optimize cleans that up.
import numpy as np
from pycograd import capture, value_and_grad, optimize
def forward(x, w, b):
h = np.tanh(x @ w + b)
return np.sum(h * h)
rng = np.random.default_rng(0)
x = rng.standard_normal((4, 3))
w = rng.standard_normal((3, 2))
b = rng.standard_normal(2)
g = capture(forward, x, w, b) # trace once over (shape, dtype) inputs
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
%3 = matmul %0 %1 -> f64[4,2]
%4 = add %3 %2 -> f64[4,2]
%5 = tanh %4 -> f64[4,2]
%6 = mul %5 %5 -> f64[4,2]
%7 = sum %6 -> f64[]
outputs: %7
}
value_and_grad(g) returns one graph holding the value and the gradient w.r.t.
every input (grad(g) keeps just the gradients). Written naïvely, the backward
pass is wasteful — it recomputes
tanh (%13, %14), doubles a multiply (%10, %11), and broadcasts a
constant 1.0 (%8, %9):
# value_and_grad(g) -- BEFORE
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
%3 = matmul %0 %1 -> f64[4,2]
%4 = add %3 %2 -> f64[4,2]
%5 = tanh %4 -> f64[4,2]
%6 = mul %5 %5 -> f64[4,2]
%7 = sum %6 -> f64[]
%8 = const 1.0 -> f64[]
%9 = broadcast_to %8 [4, 2] -> f64[4,2]
%10 = mul %9 %5 -> f64[4,2]
%11 = mul %9 %5 -> f64[4,2]
%12 = add %10 %11 -> f64[4,2]
%13 = tanh %4 -> f64[4,2] # recomputes %5
%14 = tanh %4 -> f64[4,2] # recomputes %5
%15 = mul %13 %14 -> f64[4,2] # recomputes %6
%16 = sub 1.0 %15 -> f64[4,2]
%17 = mul %12 %16 -> f64[4,2]
%18 = sum %17 {axis=0} -> f64[2]
%19 = transpose %1 [1, 0] -> f64[2,3]
%20 = matmul %17 %19 -> f64[4,3]
%21 = transpose %0 [1, 0] -> f64[3,4]
%22 = matmul %21 %17 -> f64[3,2]
outputs: %7, %20, %22, %18
}
optimize removes the redundancy by common-subexpression elimination, constant
folding, and dead-code elimination — the recomputed tanh/mul collapse back
onto %5/%6 and the broadcast folds away:
opt = optimize(value_and_grad(g))
# optimize(value_and_grad(g)) -- AFTER
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
%3 = matmul %0 %1 -> f64[4,2]
%4 = add %3 %2 -> f64[4,2]
%5 = tanh %4 -> f64[4,2]
%6 = mul %5 %5 -> f64[4,2]
%7 = sum %6 -> f64[]
%12 = add %5 %5 -> f64[4,2] # was mul %9 %5 twice; 1.0 broadcast folded away
%16 = sub 1.0 %6 -> f64[4,2] # reuses %6 = tanh^2 instead of recomputing tanh
%17 = mul %12 %16 -> f64[4,2]
%18 = sum %17 {axis=0} -> f64[2] # grad wrt b
%19 = transpose %1 [1, 0] -> f64[2,3]
%20 = matmul %17 %19 -> f64[4,3] # grad wrt x
%21 = transpose %0 [1, 0] -> f64[3,4]
%22 = matmul %21 %17 -> f64[3,2] # grad wrt w
outputs: %7, %20, %22, %18
}
Because the graph carries (shape, dtype) for every value, eval_shape /
summary can report a net's output shapes and parameter counts without running
it, and a captured forward can be handed to another framework — see below.
Training models
For writing models, %load_ext pycograd enables a small DSL (built on
pipescript): a params{ ... } block
declares the weights, a |> pipeline is the forward written once, and
weights.grad differentiates it. Here is a 2-layer MLP classifier:
%load_ext pycograd
import numpy as np
from pycograd import relu, softmax, cross_entropy
rng = np.random.default_rng(42)
# synthetic 3-class data: X is (N, 2), Y one-hot (N, 3)
centers = np.array([[2.0, 2.0], [-2.0, 2.0], [0.0, -2.5]])
X = np.vstack([rng.normal(c, 0.5, (40, 2)) for c in centers])
Y = np.eye(3)[np.repeat(np.arange(3), 40)]
with params{
w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
} as weights:
logits = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2 # the model, written once
forward = $ |> logits |> softmax
obj = |> X |> logits |> cross_entropy($, Y)
for _ in range(200):
value, grads = weights.grad(obj) # backprop
weights.step(grads, 0.5) # in-place SGD
Weights are referred to by name, frozen[...] holds one fixed, and any optimizer
can consume the gradients — swap the loop for train(weights, obj, 200, Adam(lr=cosine_decay(0.05, 200))). The same forward is also what vmap and
the compiler below consume.
Compile to PyTorch / JAX / TensorFlow
The captured graph can be lowered onto another framework's autodiff. Pass
backend= and gradients come back from torch / jax / tf instead of the numpy
tape, matching to floating-point tolerance:
for backend in ("torch", "jax", "tf"):
v, grads = weights.grad(obj, backend=backend, jit=True) # same model, framework autodiff
compile_to(forward, "torch") instead returns a plain function over the
framework's own tensors, and to_torch_module / export_torchscript /
export_onnx package a trained net for shipping with no pycograd dependency.
Examples & notebooks
The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head Transformer block, GRU/LSTM) train from scratch and are gradient-checked against finite differences:
python -m pycograd.examples
The notebooks/ directory goes deeper, each as an executable
walk-through:
pycograd_demo— linear classifier → MLP → highway net → self-attention → a Transformer encoder block.pycograd_vmap_demo— wherevmapearns its keep: per-sample gradients, gradient clipping, batched attention.pycograd_rnn_demo/pycograd_rwkv_demo— GRU/LSTM and RWKV (trained in parallel, sampled one token at a time).pycograd_compile_*— parity against PyTorch, JAX, TensorFlow, and Apple MPS, plus TorchScript / ONNX export.pycograd_graph_viz_demo— the graph IR, its rendering, and the optimization passes shown above.
How it works
-
Varis a reverse-mode tape node wrapping a numpy array. Arithmetic operators are overloaded so that running a program builds a computation graph;Var.backward()then walks it in reverse to accumulate gradients. -
Operator overloading alone is not enough. The moment user code calls a numpy function —
np.exp(x)— numpy's ufunc machinery takes over and the gradient link is lost. (Varsets__array_ufunc__ = Noneso this fails loudly instead of silently producing a wrong gradient.) pyccolo supplies the missing piece: itsbefore_callevent lets a handler replace the function being called, swappingnp.expfor a differentiabled_exptransparently — so idiomatic numpy code "just differentiates." The same mechanism routes scalarmath.*through the numpy-backed primitives and powers the|>training DSL.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycograd-0.0.4.tar.gz.
File metadata
- Download URL: pycograd-0.0.4.tar.gz
- Upload date:
- Size: 278.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88cecae110715c0e3f783fda5634641856e7ce951a47f030e90d4fdfa3c9f6e8
|
|
| MD5 |
edb6d45954b43af00a6e6cd17c9c3b0b
|
|
| BLAKE2b-256 |
1698a13957eaf68d2f8ef439ec9570a0fb4e39fdef46002e0a6cce14614f2071
|
File details
Details for the file pycograd-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pycograd-0.0.4-py3-none-any.whl
- Upload date:
- Size: 294.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e96b0ccdddfd167f4f24137a708e957fee618b84acce6bd3d20cb22fe1bb4d22
|
|
| MD5 |
a7633eae1d262f41ae3a6dfffda7a0b6
|
|
| BLAKE2b-256 |
31e92a7a0369f021229ee1404e4582775577130b85fe3941a326ad19aaba9bfb
|