Skip to main content

A small, readable reverse-mode autograd library built on numpy and pyccolo

Project description

pycograd

pycograd checked with mypy License: BSD3 Python versions PyPI version

A small, readable reverse-mode automatic-differentiation library, built on numpy and pyccolo. Write ordinary numeric Python — including numpy calls like np.exp, np.dot, np.sum and operators like @ — and get correct gradients, with no special "autodiff namespace."

The transform API is modelled on JAX: grad, vmap, and jvp are function-to-function transforms you compose freely, and a program can be captured into an inspectable, optimizable graph — a typed SSA form much like a JAX jaxpr. The difference is that pycograd is small enough to read in an afternoon, and it differentiates the numpy you already write rather than a look-alike array API.

Install

pip install pycograd

Quickstart

Hand any numpy function to grad / value_and_grad; the array argument is lifted onto the tape for you.

import numpy as np
from pycograd import value_and_grad

def f(x):
    return np.sum(np.sin(x * x))          # ordinary numpy -- and it differentiates

x = np.array([0.5, 1.0, 1.5])
value, (g,) = value_and_grad(f)(x)
# g == 2 * x * cos(x * x)

Composable transforms

The transforms are borrowed from JAX, and like JAX's they compose. grad and value_and_grad differentiate; vmap vectorizes a function written for one example over a whole batch in a single pass; jvp (with jacfwd / jacrev) gives forward-mode and Jacobians. Composing vmap with grad yields something a plain batched backward cannot — the gradient of each example separately, stacked over the batch (what gradient clipping and DP-SGD need):

import numpy as np
from pycograd import grad, vmap, cross_entropy

rng = np.random.default_rng(0)
N = 64
w = rng.standard_normal((2, 3))            # shared weights ...
b = rng.standard_normal(3)                 # ... and bias
X = rng.standard_normal((N, 2))            # N points, each (2,)
Y = np.eye(3)[rng.integers(0, 3, N)]       # N one-hot labels, each (3,)

def per_example_loss(w, b, x, y):          # one (2,) point + one label -> scalar
    return cross_entropy(x @ w + b, y)

# in_axes maps over X and Y, holds w and b shared:
gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
# gw: (N, 2, 3)   gb: (N, 3)   -- one gradient per example
# their batch-mean is exactly the ordinary full-batch gradient

relu, softmax, cross_entropy, layer_norm, and scaled-dot-product attention ship as first-class, finite-difference-checked ops, so models stay plain numpy and the transforms see straight through them.

Inspecting the graph

A numpy function can be captured into a graph instead of run — the same idea as a JAX jaxpr. capture records the forward, value_and_grad differentiates it into a combined forward+backward graph, and optimize cleans that up.

import numpy as np
from pycograd import capture, value_and_grad, optimize

def forward(x, w, b):
    h = np.tanh(x @ w + b)
    return np.sum(h * h)

rng = np.random.default_rng(0)
x = rng.standard_normal((4, 3))
w = rng.standard_normal((3, 2))
b = rng.standard_normal(2)

g = capture(forward, x, w, b)              # trace once over (shape, dtype) inputs
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  outputs: %7
}

value_and_grad(g) returns one graph holding the value and the gradient w.r.t. every input (grad(g) keeps just the gradients). Written naïvely, the backward pass is wasteful — it recomputes tanh (%13, %14), doubles a multiply (%10, %11), and broadcasts a constant 1.0 (%8, %9):

# value_and_grad(g) -- BEFORE
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %8 = const 1.0 -> f64[]
  %9 = broadcast_to %8 [4, 2] -> f64[4,2]
  %10 = mul %9 %5 -> f64[4,2]
  %11 = mul %9 %5 -> f64[4,2]
  %12 = add %10 %11 -> f64[4,2]
  %13 = tanh %4 -> f64[4,2]              # recomputes %5
  %14 = tanh %4 -> f64[4,2]              # recomputes %5
  %15 = mul %13 %14 -> f64[4,2]          # recomputes %6
  %16 = sub 1.0 %15 -> f64[4,2]
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]
  outputs: %7, %20, %22, %18
}

optimize removes the redundancy by common-subexpression elimination, constant folding, and dead-code elimination — the recomputed tanh/mul collapse back onto %5/%6 and the broadcast folds away:

opt = optimize(value_and_grad(g))
# optimize(value_and_grad(g)) -- AFTER
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %12 = add %5 %5 -> f64[4,2]            # was mul %9 %5 twice; 1.0 broadcast folded away
  %16 = sub 1.0 %6 -> f64[4,2]          # reuses %6 = tanh^2 instead of recomputing tanh
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]      # grad wrt b
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]      # grad wrt x
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]      # grad wrt w
  outputs: %7, %20, %22, %18
}

Because the graph carries (shape, dtype) for every value, eval_shape / summary can report a net's output shapes and parameter counts without running it, and a captured forward can be handed to another framework — see below.

Training models

For writing models, %load_ext pycograd enables a small DSL (built on pipescript): a params{ ... } block declares the weights, a |> pipeline is the forward written once, and weights.grad differentiates it. Here is a 2-layer MLP classifier:

%load_ext pycograd
import numpy as np
from pycograd import relu, softmax, cross_entropy

rng = np.random.default_rng(42)

# synthetic 3-class data: X is (N, 2), Y one-hot (N, 3)
centers = np.array([[2.0, 2.0], [-2.0, 2.0], [0.0, -2.5]])
X = np.vstack([rng.normal(c, 0.5, (40, 2)) for c in centers])
Y = np.eye(3)[np.repeat(np.arange(3), 40)]

with params{
    w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
    w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
} as weights:
    logits  = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2     # the model, written once
    forward = $ |> logits |> softmax
    obj     = |> X |> logits |> cross_entropy($, Y)
    for _ in range(200):
        value, grads = weights.grad(obj)                  # backprop
        weights.step(grads, 0.5)                          # in-place SGD

Weights are referred to by name, frozen[...] holds one fixed, and any optimizer can consume the gradients — swap the loop for train(weights, obj, 200, Adam(lr=cosine_decay(0.05, 200))). The same forward is also what vmap and the compiler below consume.

Compile to PyTorch / JAX / TensorFlow

The captured graph can be lowered onto another framework's autodiff. Pass backend= and gradients come back from torch / jax / tf instead of the numpy tape, matching to floating-point tolerance:

for backend in ("torch", "jax", "tf"):
    v, grads = weights.grad(obj, backend=backend, jit=True)   # same model, framework autodiff

compile_to(forward, "torch") instead returns a plain function over the framework's own tensors, and to_torch_module / export_torchscript / export_onnx package a trained net for shipping with no pycograd dependency.

Examples & notebooks

The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head Transformer block, GRU/LSTM) train from scratch and are gradient-checked against finite differences:

python -m pycograd.examples

The notebooks/ directory goes deeper, each as an executable walk-through:

  • pycograd_demo — linear classifier → MLP → highway net → self-attention → a Transformer encoder block.
  • pycograd_vmap_demo — where vmap earns its keep: per-sample gradients, gradient clipping, batched attention.
  • pycograd_rnn_demo / pycograd_rwkv_demo — GRU/LSTM and RWKV (trained in parallel, sampled one token at a time).
  • pycograd_compile_* — parity against PyTorch, JAX, TensorFlow, and Apple MPS, plus TorchScript / ONNX export.
  • pycograd_graph_viz_demo — the graph IR, its rendering, and the optimization passes shown above.

How it works

  • Var is a reverse-mode tape node wrapping a numpy array. Arithmetic operators are overloaded so that running a program builds a computation graph; Var.backward() then walks it in reverse to accumulate gradients.

  • Operator overloading alone is not enough. The moment user code calls a numpy function — np.exp(x) — numpy's ufunc machinery takes over and the gradient link is lost. (Var sets __array_ufunc__ = None so this fails loudly instead of silently producing a wrong gradient.) pyccolo supplies the missing piece: its before_call event lets a handler replace the function being called, swapping np.exp for a differentiable d_exp transparently — so idiomatic numpy code "just differentiates." The same mechanism routes scalar math.* through the numpy-backed primitives and powers the |> training DSL.

License

BSD-3-Clause.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycograd-0.0.3.tar.gz (265.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycograd-0.0.3-py3-none-any.whl (281.5 kB view details)

Uploaded Python 3

File details

Details for the file pycograd-0.0.3.tar.gz.

File metadata

  • Download URL: pycograd-0.0.3.tar.gz
  • Upload date:
  • Size: 265.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.3.tar.gz
Algorithm Hash digest
SHA256 5f6cc04bbf20e290a619e71c126074a8d0af52ba8d258b1bd83bc1b1c9c05fd3
MD5 46578aed10d0467d0671b28145ffc255
BLAKE2b-256 6071a7e649fdb4c81969420a84cb346a999d609c708c71426b76d5a776407b65

See more details on using hashes here.

File details

Details for the file pycograd-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pycograd-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 281.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 47d682591f0d03dfa91f34abb2e467903556fcd49d6f75e21488a73703c89af5
MD5 73b77d605c199e1db1ef9d4bc9cbac40
BLAKE2b-256 421447ca30bb85d88c9cd2be8e225192f5bb77229ec12f22a2e693fd3c5d2682

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page