Skip to main content

A small, readable reverse-mode autograd library built on numpy and pyccolo

Project description

pycograd

pycograd checked with mypy License: BSD3 Python versions PyPI version

A small, readable reverse-mode automatic-differentiation library, built on numpy and pyccolo. Write ordinary numeric Python — including numpy calls like np.exp, np.dot, np.sum and operators like @ — and get correct gradients, with no special "autodiff namespace."

The transform API is modelled on JAX: grad, vmap, and jvp are function-to-function transforms you compose freely, and a program can be captured into an inspectable, optimizable graph — a typed SSA form much like a JAX jaxpr. The difference is that pycograd is small enough to read in an afternoon, and it differentiates the numpy you already write rather than a look-alike array API.

Install

pip install pycograd

Quickstart

Hand any numpy function to grad / value_and_grad; the array argument is lifted onto the tape for you.

import numpy as np
from pycograd import value_and_grad

def f(x):
    return np.sum(np.sin(x * x))          # ordinary numpy -- and it differentiates

x = np.array([0.5, 1.0, 1.5])
value, (g,) = value_and_grad(f)(x)
# g == 2 * x * cos(x * x)

Composable transforms

The transforms are borrowed from JAX, and like JAX's they compose. grad and value_and_grad differentiate; vmap vectorizes a function written for one example over a whole batch in a single pass; jvp (with jacfwd / jacrev) gives forward-mode and Jacobians. Composing vmap with grad yields something a plain batched backward cannot — the gradient of each example separately, stacked over the batch (what gradient clipping and DP-SGD need):

import numpy as np
from pycograd import grad, vmap, cross_entropy

rng = np.random.default_rng(0)
N = 64
w = rng.standard_normal((2, 3))            # shared weights ...
b = rng.standard_normal(3)                 # ... and bias
X = rng.standard_normal((N, 2))            # N points, each (2,)
Y = np.eye(3)[rng.integers(0, 3, N)]       # N one-hot labels, each (3,)

def per_example_loss(w, b, x, y):          # one (2,) point + one label -> scalar
    return cross_entropy(x @ w + b, y)

# in_axes maps over X and Y, holds w and b shared:
gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
# gw: (N, 2, 3)   gb: (N, 3)   -- one gradient per example
# their batch-mean is exactly the ordinary full-batch gradient

relu, softmax, cross_entropy, layer_norm, and scaled-dot-product attention ship as first-class, finite-difference-checked ops, so models stay plain numpy and the transforms see straight through them.

Inspecting the graph

A numpy function can be captured into a graph instead of run — the same idea as a JAX jaxpr. capture records the forward, value_and_grad differentiates it into a combined forward+backward graph, and optimize cleans that up.

import numpy as np
from pycograd import capture, value_and_grad, optimize

def forward(x, w, b):
    h = np.tanh(x @ w + b)
    return np.sum(h * h)

rng = np.random.default_rng(0)
x = rng.standard_normal((4, 3))
w = rng.standard_normal((3, 2))
b = rng.standard_normal(2)

g = capture(forward, x, w, b)              # trace once over (shape, dtype) inputs
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  outputs: %7
}

value_and_grad(g) returns one graph holding the value and the gradient w.r.t. every input (grad(g) keeps just the gradients). Written naïvely, the backward pass is wasteful — it recomputes tanh (%13, %14), doubles a multiply (%10, %11), and broadcasts a constant 1.0 (%8, %9):

# value_and_grad(g) -- BEFORE
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %8 = const 1.0 -> f64[]
  %9 = broadcast_to %8 [4, 2] -> f64[4,2]
  %10 = mul %9 %5 -> f64[4,2]
  %11 = mul %9 %5 -> f64[4,2]
  %12 = add %10 %11 -> f64[4,2]
  %13 = tanh %4 -> f64[4,2]              # recomputes %5
  %14 = tanh %4 -> f64[4,2]              # recomputes %5
  %15 = mul %13 %14 -> f64[4,2]          # recomputes %6
  %16 = sub 1.0 %15 -> f64[4,2]
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]
  outputs: %7, %20, %22, %18
}

optimize removes the redundancy by common-subexpression elimination, constant folding, and dead-code elimination — the recomputed tanh/mul collapse back onto %5/%6 and the broadcast folds away:

opt = optimize(value_and_grad(g))
# optimize(value_and_grad(g)) -- AFTER
graph(%0:f64[4,3], %1:f64[3,2], %2:f64[2]) {
  %3 = matmul %0 %1 -> f64[4,2]
  %4 = add %3 %2 -> f64[4,2]
  %5 = tanh %4 -> f64[4,2]
  %6 = mul %5 %5 -> f64[4,2]
  %7 = sum %6 -> f64[]
  %12 = add %5 %5 -> f64[4,2]            # was mul %9 %5 twice; 1.0 broadcast folded away
  %16 = sub 1.0 %6 -> f64[4,2]          # reuses %6 = tanh^2 instead of recomputing tanh
  %17 = mul %12 %16 -> f64[4,2]
  %18 = sum %17 {axis=0} -> f64[2]      # grad wrt b
  %19 = transpose %1 [1, 0] -> f64[2,3]
  %20 = matmul %17 %19 -> f64[4,3]      # grad wrt x
  %21 = transpose %0 [1, 0] -> f64[3,4]
  %22 = matmul %21 %17 -> f64[3,2]      # grad wrt w
  outputs: %7, %20, %22, %18
}

Because the graph carries (shape, dtype) for every value, eval_shape / summary can report a net's output shapes and parameter counts without running it, and a captured forward can be handed to another framework — see below.

Training models

For writing models, %load_ext pycograd enables a small DSL (built on pipescript): a params{ ... } block declares the weights, a |> pipeline is the forward written once, and weights.grad differentiates it. Here is a 2-layer MLP classifier:

%load_ext pycograd
import numpy as np
from pycograd import relu, softmax, cross_entropy

rng = np.random.default_rng(42)

# synthetic 3-class data: X is (N, 2), Y one-hot (N, 3)
centers = np.array([[2.0, 2.0], [-2.0, 2.0], [0.0, -2.5]])
X = np.vstack([rng.normal(c, 0.5, (40, 2)) for c in centers])
Y = np.eye(3)[np.repeat(np.arange(3), 40)]

with params{
    w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
    w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
} as weights:
    logits  = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2     # the model, written once
    forward = $ |> logits |> softmax
    obj     = |> X |> logits |> cross_entropy($, Y)
    for _ in range(200):
        value, grads = weights.grad(obj)                  # backprop
        weights.step(grads, 0.5)                          # in-place SGD

Weights are referred to by name, frozen[...] holds one fixed, and any optimizer can consume the gradients — swap the loop for train(weights, obj, 200, Adam(lr=cosine_decay(0.05, 200))). The same forward is also what vmap and the compiler below consume.

Compile to PyTorch / JAX / TensorFlow

The captured graph can be lowered onto another framework's autodiff. Pass backend= and gradients come back from torch / jax / tf instead of the numpy tape, matching to floating-point tolerance:

for backend in ("torch", "jax", "tf"):
    v, grads = weights.grad(obj, backend=backend, jit=True)   # same model, framework autodiff

compile_to(forward, "torch") instead returns a plain function over the framework's own tensors, and to_torch_module / export_torchscript / export_onnx package a trained net for shipping with no pycograd dependency.

Examples & notebooks

The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head Transformer block, GRU/LSTM) train from scratch and are gradient-checked against finite differences:

python -m pycograd.examples

The notebooks/ directory goes deeper, each as an executable walk-through:

  • pycograd_demo — linear classifier → MLP → highway net → self-attention → a Transformer encoder block.
  • pycograd_vmap_demo — where vmap earns its keep: per-sample gradients, gradient clipping, batched attention.
  • pycograd_rnn_demo / pycograd_rwkv_demo — GRU/LSTM and RWKV (trained in parallel, sampled one token at a time).
  • pycograd_compile_* — parity against PyTorch, JAX, TensorFlow, and Apple MPS, plus TorchScript / ONNX export.
  • pycograd_graph_viz_demo — the graph IR, its rendering, and the optimization passes shown above.

How it works

  • Var is a reverse-mode tape node wrapping a numpy array. Arithmetic operators are overloaded so that running a program builds a computation graph; Var.backward() then walks it in reverse to accumulate gradients.

  • Operator overloading alone is not enough. The moment user code calls a numpy function — np.exp(x) — numpy's ufunc machinery takes over and the gradient link is lost. (Var sets __array_ufunc__ = None so this fails loudly instead of silently producing a wrong gradient.) pyccolo supplies the missing piece: its before_call event lets a handler replace the function being called, swapping np.exp for a differentiable d_exp transparently — so idiomatic numpy code "just differentiates." The same mechanism routes scalar math.* through the numpy-backed primitives and powers the |> training DSL.

License

BSD-3-Clause.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycograd-0.0.4.tar.gz (278.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycograd-0.0.4-py3-none-any.whl (294.7 kB view details)

Uploaded Python 3

File details

Details for the file pycograd-0.0.4.tar.gz.

File metadata

  • Download URL: pycograd-0.0.4.tar.gz
  • Upload date:
  • Size: 278.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.4.tar.gz
Algorithm Hash digest
SHA256 88cecae110715c0e3f783fda5634641856e7ce951a47f030e90d4fdfa3c9f6e8
MD5 edb6d45954b43af00a6e6cd17c9c3b0b
BLAKE2b-256 1698a13957eaf68d2f8ef439ec9570a0fb4e39fdef46002e0a6cce14614f2071

See more details on using hashes here.

File details

Details for the file pycograd-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pycograd-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 294.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pycograd-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e96b0ccdddfd167f4f24137a708e957fee618b84acce6bd3d20cb22fe1bb4d22
MD5 a7633eae1d262f41ae3a6dfffda7a0b6
BLAKE2b-256 31e92a7a0369f021229ee1404e4582775577130b85fe3941a326ad19aaba9bfb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page