Skip to main content

A small, readable reverse-mode autograd library built on numpy and pyccolo

Project description

pycograd

pycograd checked with mypy License: BSD3 Python versions PyPI version

A small, readable reverse-mode automatic-differentiation library, built on numpy and pyccolo. Write ordinary numeric Python — including numpy calls like np.exp, np.dot, np.sum and operators like @ — and get correct gradients, with no special "autodiff namespace."

The transform API is modelled on JAX: grad, vmap, and jvp are function-to-function transforms you compose freely, and a program can be captured into an inspectable, optimizable graph — a typed SSA form much like a JAX jaxpr. The difference is that pycograd is small enough to read in an afternoon, and it differentiates the numpy you already write rather than a look-alike array API.

Install

pip install pycograd

Quickstart

Hand any numpy function to grad / value_and_grad; the array argument is lifted onto the tape for you.

import numpy as np
from pycograd import value_and_grad

def f(x):
    return np.sum(np.sin(x * x))          # ordinary numpy -- and it differentiates

x = np.array([0.5, 1.0, 1.5])
value, (g,) = value_and_grad(f)(x)
# g == 2 * x * cos(x * x)

Composable transforms

The transforms are borrowed from JAX, and like JAX's they compose. grad and value_and_grad differentiate; vmap vectorizes a function written for one example over a whole batch in a single pass; jvp (with jacfwd / jacrev) gives forward-mode and Jacobians. Composing vmap with grad yields something a plain batched backward cannot — the gradient of each example separately, stacked over the batch (what gradient clipping and DP-SGD need):

import numpy as np
from pycograd import grad, vmap, cross_entropy

rng = np.random.default_rng(0)
N = 64
w = rng.standard_normal((2, 3))            # shared weights ...
b = rng.standard_normal(3)                 # ... and bias
X = rng.standard_normal((N, 2))            # N points, each (2,)
Y = np.eye(3)[rng.integers(0, 3, N)]       # N one-hot labels, each (3,)

def per_example_loss(w, b, x, y):          # one (2,) point + one label -> scalar
    return cross_entropy(x @ w + b, y)

# in_axes maps over X and Y, holds w and b shared:
gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
# gw: (N, 2, 3)   gb: (N, 3)   -- one gradient per example
# their batch-mean is exactly the ordinary full-batch gradient

relu, softmax, cross_entropy, layer_norm, and scaled-dot-product attention ship as first-class, finite-difference-checked ops, so models stay plain numpy and the transforms see straight through them.

Inspecting the graph

A numpy function can be captured into a graph instead of run — the same idea as a JAX jaxpr. capture records the forward, value_and_grad differentiates it into a combined forward+backward graph, and optimize cleans that up.

import numpy as np
from pycograd import capture, value_and_grad, optimize

def forward(x, w, b):
    h = np.tanh(x @ w + b)
    return np.sum(h * h)

rng = np.random.default_rng(0)
x = rng.standard_normal((4, 3))
w = rng.standard_normal((3, 2))
b = rng.standard_normal(2)

g = capture(forward, x, w, b)              # trace once over (shape, dtype) inputs
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  outputs: %7
}

value_and_grad(g) returns one graph holding the value and the gradient w.r.t. every input (grad(g) keeps just the gradients). Written naïvely, the backward pass is wasteful — it recomputes tanh (%13, %14), doubles a multiply (%10, %11), and broadcasts a constant 1.0 (%8, %9):

# value_and_grad(g) -- BEFORE
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %8 = const 1.0 -> f64[]
  %9 = broadcast_to %8 [4, 2] -> f64[4,2]
  %10 = mul %9 %5 -> f64[4,2]
  %11 = mul %9 %5 -> f64[4,2]
  %12 = add %10 %11 -> f64[4,2]
  %13 = tanh %4 -> f64[4,2]              # recomputes %5
  %14 = tanh %4 -> f64[4,2]              # recomputes %5
  %15 = mul %13 %14 -> f64[4,2]          # recomputes %6
  %16 = sub 1.0 %15 -> f64[4,2]
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]
  outputs: %7, %20, %22, %18
}

optimize removes the redundancy by common-subexpression elimination, constant folding, and dead-code elimination — the recomputed tanh/mul collapse back onto %5/%6 and the broadcast folds away:

opt = optimize(value_and_grad(g))
# optimize(value_and_grad(g)) -- AFTER
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %12 = add %5 %5 -> f64[4,2]            # was mul %9 %5 twice; 1.0 broadcast folded away
  %16 = sub 1.0 %6 -> f64[4,2]          # reuses %6 = tanh^2 instead of recomputing tanh
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]      # grad wrt b
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]      # grad wrt x
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]      # grad wrt w
  outputs: %7, %20, %22, %18
}

Because the graph carries (shape, dtype) for every value, eval_shape / summary can report a net's output shapes and parameter counts without running it, and a captured forward can be handed to another framework — see below.

Training models

For writing models, %load_ext pycograd enables a small DSL (built on pipescript): a params{ ... } block declares the weights, a |> pipeline is the forward written once, and weights.grad differentiates it. Here is a 2-layer MLP classifier:

%load_ext pycograd
import numpy as np
from pycograd import relu, softmax, cross_entropy

rng = np.random.default_rng(42)

# synthetic 3-class data: X is (N, 2), Y one-hot (N, 3)
centers = np.array([[2.0, 2.0], [-2.0, 2.0], [0.0, -2.5]])
X = np.vstack([rng.normal(c, 0.5, (40, 2)) for c in centers])
Y = np.eye(3)[np.repeat(np.arange(3), 40)]

with params{
    w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
    w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
} as weights:
    logits  = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2     # the model, written once
    forward = $ |> logits |> softmax
    obj     = |> X |> logits |> cross_entropy($, Y)
    for _ in range(200):
        value, grads = weights.grad(obj)                  # backprop
        weights.step(grads, 0.5)                          # in-place SGD

Weights are referred to by name, frozen[...] holds one fixed, and any optimizer can consume the gradients — swap the loop for train(weights, obj, 200, Adam(lr=cosine_decay(0.05, 200))). The same forward is also what vmap and the compiler below consume.

Compile to PyTorch / JAX / TensorFlow

The captured graph can be lowered onto another framework's autodiff. Pass backend= and gradients come back from torch / jax / tf instead of the numpy tape, matching to floating-point tolerance:

for backend in ("torch", "jax", "tf"):
    v, grads = weights.grad(obj, backend=backend, jit=True)   # same model, framework autodiff

compile_to(forward, "torch") instead returns a plain function over the framework's own tensors, and to_torch_module / export_torchscript / export_onnx package a trained net for shipping with no pycograd dependency.

Examples & notebooks

The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head Transformer block, GRU/LSTM) train from scratch and are gradient-checked against finite differences:

python -m pycograd.examples

The notebooks/ directory goes deeper, each as an executable walk-through:

  • pycograd_demo — linear classifier → MLP → highway net → self-attention → a Transformer encoder block.
  • pycograd_vmap_demo — where vmap earns its keep: per-sample gradients, gradient clipping, batched attention.
  • pycograd_rnn_demo / pycograd_rwkv_demo — GRU/LSTM and RWKV (trained in parallel, sampled one token at a time).
  • pycograd_compile_* — parity against PyTorch, JAX, TensorFlow, and Apple MPS, plus TorchScript / ONNX export.
  • pycograd_graph_viz_demo — the graph IR, its rendering, and the optimization passes shown above.

How it works

  • Var is a reverse-mode tape node wrapping a numpy array. Arithmetic operators are overloaded so that running a program builds a computation graph; Var.backward() then walks it in reverse to accumulate gradients.

  • Operator overloading alone is not enough. The moment user code calls a numpy function — np.exp(x) — numpy's ufunc machinery takes over and the gradient link is lost. (Var sets __array_ufunc__ = None so this fails loudly instead of silently producing a wrong gradient.) pyccolo supplies the missing piece: its before_call event lets a handler replace the function being called, swapping np.exp for a differentiable d_exp transparently — so idiomatic numpy code "just differentiates." The same mechanism routes scalar math.* through the numpy-backed primitives and powers the |> training DSL.

License

BSD-3-Clause.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycograd-0.0.2.tar.gz (263.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycograd-0.0.2-py3-none-any.whl (281.5 kB view details)

Uploaded Python 3

File details

Details for the file pycograd-0.0.2.tar.gz.

File metadata

  • Download URL: pycograd-0.0.2.tar.gz
  • Upload date:
  • Size: 263.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.2.tar.gz
Algorithm Hash digest
SHA256 156c4ade7a33dd0e5539b0faa333305e738dbcca3ffb04c9899052a97826dadd
MD5 33f6444fe19787ce34fb3614fec7f453
BLAKE2b-256 6cb7d8095c3db08d8f27f2f42c7dd233336e1a8821f39c066fbedb99360d97d4

See more details on using hashes here.

File details

Details for the file pycograd-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pycograd-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 281.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 634605472cdfd9b43af434c6b8afc314acbcda482df509ec2062cc4569bc7fe9
MD5 bceaa2b61ec289102abf3f6806668c4a
BLAKE2b-256 e1a1dcd613cc9369625a9263ca327395c136a0200ba1d21166eef7f448356bf4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page