Skip to main content

TinyGNN — A zero-dependency C++17 inference engine for Sparse Graph Neural Networks with Python bindings

Project description

TinyGNN

A dependency-free C++ inference engine optimized for Sparse Graph Neural Networks.

TinyGNN pivots from a dense-only deep learning runtime to a sparse-native architecture, demonstrating low-level systems programming, memory architecture optimizations, and parallel computing -- with zero external dependencies.


Project Roadmap

Phase Title Status
1 Hybrid Tensor Core Complete
2 Graph Data Loader Complete
3 Dense Compute Kernels (GEMM) Complete
4 Sparse Compute Kernels (SpMM) Complete
5 Activations & Utilities Complete
6 Multi-Architecture Layer Assembly Complete
7 Multi-Model End-to-End Pipeline & Reference Matching Complete
8 Python Bridge (pybind11) Complete
9 Software-Level Parallelism (OpenMP & AVX2) Complete
10 Operator Fusion (GAT & GraphSAGE) Complete
11 Packaging, CI/CD & Documentation Complete

Project Structure

TinyGNN/
├── CMakeLists.txt
├── setup.py                    # pybind11 Python extension build
├── pyproject.toml              # Phase 11: PEP 517/518 packaging metadata (PyPI)
├── MANIFEST.in                 # Phase 11: source distribution manifest
├── LICENSE                     # Phase 11: MIT license
├── CITATION.cff                # Phase 11: JOSS/JMLR citation metadata
├── CONTRIBUTING.md             # Phase 11: contributor guidelines
├── Doxyfile                    # Phase 11: Doxygen C++ documentation config
├── .github/
│   └── workflows/
│       └── ci.yml              # Phase 11: GitHub Actions CI/CD pipeline
├── docs/                       # Phase 11: Sphinx documentation
│   ├── conf.py                 # Sphinx configuration
│   ├── index.rst               # Documentation home page
│   ├── installation.rst        # Installation guide
│   ├── quickstart.rst          # Quick start tutorial
│   ├── python_api.rst          # Python API reference
│   ├── cpp_api.rst             # C++ API reference (Breathe + Doxygen)
│   ├── benchmarks.rst          # Benchmark results
│   └── Makefile                # Sphinx build helper
├── benchmarks/
│   ├── bench_parallel.cpp      # Phase 9: OpenMP + AVX2 thread-scaling benchmark
│   ├── bench_fusion.cpp        # Phase 10: fused vs. unfused GAT/SAGE runtime + memory
│   ├── massif_gat_fused.cpp    # Phase 10: Massif profiling (fused GAT heap)
│   ├── massif_gat_unfused.cpp  # Phase 10: Massif profiling (unfused GAT heap)
│   └── thread_scaling.png      # Phase 9: generated thread-scaling chart
├── include/
│   └── tinygnn/
│       ├── tensor.hpp          # StorageFormat enum + Tensor struct
│       ├── graph_loader.hpp    # GraphData + GraphLoader (CSV -> CSR pipeline)
│       ├── ops.hpp             # compute kernels + activations: matmul, spmm, relu, softmax, etc.
│       ├── layers.hpp          # GCN, GraphSAGE, GAT layer structs + helpers
│       └── model.hpp           # Model execution graph + binary weight/graph loaders
├── src/
│   ├── tensor.cpp
│   ├── graph_loader.cpp        # CSV parser + edge-list-to-CSR conversion
│   ├── ops.cpp                 # matmul, spmm, relu, leaky_relu, elu, sigmoid, tanh, gelu, softmax, log_softmax, add_bias
│   ├── layers.cpp              # add_self_loops, gcn_norm, GCNLayer, SAGELayer, GATLayer, edge_softmax
│   └── model.cpp               # Model::forward(), load_cora_binary(), load_weight_file()
├── python/
│   ├── tinygnn_ext.cpp         # Phase 8: pybind11 C++ bindings (280+ lines)
│   └── tinygnn/
│       └── __init__.py         # Phase 8: Python package re-exports
├── tests/
│   ├── test_tensor.cpp         # Phase 1 -- 104 assertions
│   ├── test_graph_loader.cpp   # Phase 2 -- 269 assertions
│   ├── test_matmul.cpp         # Phase 3 -- 268 assertions
│   ├── test_spmm.cpp           # Phase 4 -- 306 assertions
│   ├── test_activations.cpp    # Phase 5 -- 266 assertions
│   ├── test_gcn.cpp            # Phase 6a -- 764 assertions
│   ├── test_graphsage.cpp      # Phase 6b -- 1,011 assertions
│   ├── test_gat.cpp            # Phase 6c -- 1,284 assertions
│   ├── test_e2e.cpp            # Phase 7  -- 13,862 assertions (end-to-end pipeline)
│   └── test_python_bindings.py # Phase 8  -- 49 pytest tests (Python bridge)
├── datasets/                   # downloaded via scripts/fetch_datasets.py
│   ├── cora/                   # 2,708 nodes, 5,429 edges, 1,433 features
│   └── reddit/                 # 232,965 nodes, 114,615,892 edges, 602 features
├── weights/                    # exported by train_cora.py (gitignored)
│   ├── cora_graph.bin          # Cora graph in binary (features, labels, masks, edges)
│   ├── gcn_cora.bin            # GCN trained weights (~78% test accuracy)
│   ├── sage_cora.bin           # GraphSAGE trained weights (~81% test accuracy)
│   └── gat_cora.bin            # GAT trained weights (~80% test accuracy)
└── scripts/
    ├── install_wsl_tools.sh    # one-time WSL tooling setup
    ├── fetch_datasets.py       # download Cora + Reddit, convert to CSV
    ├── train_cora.py           # PyG training: GCN/SAGE/GAT → binary weight export
    ├── validate_cora.py        # PyG ↔ TinyGNN logit-level comparison
    ├── plot_scaling.py         # thread-scaling chart generator (matplotlib)
    ├── run_massif_phase9.sh    # Massif memory profiling (fused vs unfused)
    ├── build_python.py         # build Python extension + run tests
    ├── sanitizers.sh           # ASan + UBSan (27 configs, all phases)
    └── valgrind_all.sh         # Memcheck + Helgrind + Callgrind (all phases)

Installation

Option 1 — PyPI (Python users)

pip install tinygnn

Installs the pre-built Python extension wheel (_tinygnn_core) plus the tinygnn Python package. Supports Python 3.8–3.13 on Linux, macOS, and Windows (x86-64).

Development install from source

git clone https://github.com/AnubhavChoudhery/TinyGNN.git
cd TinyGNN
pip install -e ".[dev]"    # builds the C++ extension in-place + installs dev deps

Verify the install

python -c "import tinygnn; print(tinygnn.__version__)"    # → 0.1.5
python -m pytest tests/test_python_bindings.py -v         # 49 tests
python scripts/validate_cora.py --logit-check             # PyG logit comparison

Option 3 — Build from source (C++ only)

Prerequisites

  • C++17-capable compiler: GCC 8+, Clang 7+, MSVC 2019+
  • CMake 3.16+ (optional — g++ directly also works)
  • OpenMP (bundled with GCC; for macOS use brew install libomp)
  • CPU with AVX2 + FMA (Intel Haswell+ / AMD Zen+; scalar fallback always available)

With CMake

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel
ctest --test-dir build -C Release --output-on-failure

CMake options:

Option Default Description
TINYGNN_BUILD_TESTS ON Build test executables
TINYGNN_BUILD_BENCHMARKS ON Build C++ benchmark executables

Directly with g++ (single test example)

g++ -std=c++17 -O2 -fopenmp -mavx2 -mfma -Wall -Wextra -I include \
    src/tensor.cpp src/graph_loader.cpp src/ops.cpp src/layers.cpp src/model.cpp \
    tests/test_tensor.cpp -o build/test_tensor && ./build/test_tensor

Python SDK Reference

Install: pip install tinygnn

Tensor

import tinygnn
import numpy as np
from scipy.sparse import csr_matrix

# Dense tensors from NumPy (float32)
arr = np.random.randn(10, 16).astype(np.float32)
X = tinygnn.Tensor.from_numpy(arr)
print(X.rows, X.cols)           # 10, 16
out = X.to_numpy()              # → np.ndarray shape (10, 16)

# 1-D arrays (bias vectors) → (1 × N) Tensor
b = tinygnn.Tensor.from_numpy_1d(np.zeros(16, dtype=np.float32))

# Sparse CSR from SciPy
mat = csr_matrix(...)
A = tinygnn.Tensor.from_scipy_csr(mat)

# Sparse CSR from PyG edge_index
import torch
ei = torch.randint(0, 100, (2, 500))
A = tinygnn.Tensor.from_edge_index(ei, num_nodes=100)

# Inspect CSR internals
rp = A.row_ptr_numpy()    # → np.ndarray int32
ci = A.col_ind_numpy()    # → np.ndarray int32

Ops

# Dense GEMM
C = tinygnn.matmul(A, B)   # Dense A (M×K), Dense B (K×N) → Dense C (M×N)

# Sparse-Dense SpMM
C = tinygnn.spmm(A, B)     # SparseCSR A (M×K), Dense B (K×N) → Dense C (M×N)

# In-place activations
tinygnn.relu_inplace(X)
tinygnn.leaky_relu_inplace(X, alpha=0.01)
tinygnn.elu_inplace(X, alpha=1.0)
tinygnn.sigmoid_inplace(X)
tinygnn.tanh_inplace(X)
tinygnn.gelu_inplace(X)
tinygnn.softmax_inplace(X)        # row-wise, numerically stable
tinygnn.log_softmax_inplace(X)    # row-wise log-softmax

# Bias broadcast  (X[i][j] += bias[0][j])
tinygnn.add_bias(X, bias)

Graph utilities

A_sl   = tinygnn.add_self_loops(A)      # insert identity self-loops
A_norm = tinygnn.gcn_norm(A)            # D̃^(-1/2)(A+I)D̃^(-1/2), adds self-loops
alpha  = tinygnn.edge_softmax(attn_csr) # sparse row-wise softmax
agg    = tinygnn.sage_max_aggregate(A, H)  # CSR neighborhood max-pool

Layers

import tinygnn

# GCN layer  (A must be gcn_norm output)
gcn = tinygnn.GCNLayer(F_in=64, F_out=32, bias=True,
                        activation=tinygnn.Activation.RELU)
gcn.set_weight(tinygnn.Tensor.from_numpy(W))     # Dense (F_in × F_out)
gcn.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = gcn.forward(A_norm, H)

# GraphSAGE layer
sage = tinygnn.SAGELayer(F_in=64, F_out=32,
                          aggregator=tinygnn.SAGELayer.Aggregator.Mean,
                          bias=True,
                          activation=tinygnn.Activation.RELU)
# Aggregator.Mean | Aggregator.Max
sage.set_weight_neigh(tinygnn.Tensor.from_numpy(Wn))
sage.set_weight_self(tinygnn.Tensor.from_numpy(Ws))
sage.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = sage.forward(A, H)    # A is raw adjacency (no gcn_norm)

# GAT layer  (A must include self-loops via add_self_loops)
gat = tinygnn.GATLayer(F_in=64, F_out=32, negative_slope=0.2,
                        bias=True, activation=tinygnn.Activation.NONE)
gat.set_weight(tinygnn.Tensor.from_numpy(W))
gat.set_attn_left(tinygnn.Tensor.from_numpy(al))
gat.set_attn_right(tinygnn.Tensor.from_numpy(ar))
gat.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = gat.forward(A_sl, H)

End-to-end: two-layer GCN on Cora

import tinygnn
import numpy as np

# Load pre-trained weights (produced by scripts/train_cora.py)
cora = tinygnn.load_cora_binary("weights/cora_graph.bin")

model = tinygnn.Model()
model.add_gcn_layer(1433, 64, bias=True, activation=tinygnn.Activation.RELU)
model.add_gcn_layer(64,   7,  bias=True, activation=tinygnn.Activation.NONE)
model.load_weights("weights/gcn_cora.bin")

logits      = model.forward(cora.adjacency, cora.features)  # Dense (2708 × 7)
predictions = logits.to_numpy().argmax(axis=1)               # (2708,) int

test_mask   = np.array(cora.test_mask)
labels      = np.array(cora.labels)
accuracy    = (predictions[test_mask] == labels[test_mask]).mean()
print(f"Test accuracy: {accuracy:.1%}")   # → ~78%

Enumerations

tinygnn.StorageFormat.Dense        # 0
tinygnn.StorageFormat.SparseCSR    # 1

tinygnn.Activation.NONE
tinygnn.Activation.RELU
tinygnn.Activation.ELU
tinygnn.Activation.SIGMOID
tinygnn.Activation.TANH
tinygnn.Activation.LEAKY_RELU
tinygnn.Activation.GELU
tinygnn.Activation.SOFTMAX
tinygnn.Activation.LOG_SOFTMAX

tinygnn.SAGELayer.Aggregator.Mean
tinygnn.SAGELayer.Aggregator.Max

GraphLoader (Python)

from tinygnn import GraphLoader

# Parse CSVs directly in Python
edges = GraphLoader.parse_edges("datasets/cora/edges.csv")
feats = GraphLoader.parse_features("datasets/cora/features.csv")
adj   = GraphLoader.edge_list_to_csr(edges, num_nodes=2708)

# Or use the full pipeline
data = GraphLoader.load("datasets/cora/edges.csv",
                        "datasets/cora/features.csv")

SDK Reference (C++ API)

Include the top-level headers — all APIs are in the global namespace:

#include "tinygnn/tensor.hpp"       // Tensor, StorageFormat
#include "tinygnn/graph_loader.hpp" // GraphLoader, GraphData
#include "tinygnn/ops.hpp"          // matmul, spmm, activations, utilities
#include "tinygnn/layers.hpp"       // GCNLayer, SAGELayer, GATLayer
#include "tinygnn/model.hpp"        // Model, load_cora_binary, load_weight_file

Link with tinygnn::tinygnn_core (CMake) or compile with all .cpp files under src/.

Tensor

// --- Construction (Dense) ---
Tensor t(rows, cols);                         // zero-initialized row-major float32
Tensor t(rows, cols, {v00, v01, ...});        // from initializer list

// --- Construction (SparseCSR) ---
Tensor t(rows, cols,
         std::vector<float>   values,
         std::vector<int32_t> row_ptr,        // size rows+1
         std::vector<int32_t> col_ind);

// --- Inspection ---
t.rows();                   // number of rows
t.cols();                   // number of columns
t.nnz();                    // non-zeros (SparseCSR only)
t.format();                 // StorageFormat::Dense | SparseCSR
t.memory_footprint_bytes(); // exact byte count including all arrays

// --- Element access (Dense only) ---
float v = t.at(i, j);      // bounds-checked read
t.set(i, j, value);        // bounds-checked write
float* p = t.data_ptr();   // raw pointer (row-major)

// --- CSR accessors (SparseCSR only) ---
const float*   t.csr_values();
const int32_t* t.csr_row_ptr();
const int32_t* t.csr_col_ind();

Graph Utilities

#include "tinygnn/ops.hpp"

// Add identity self-loops (modifies CSR structure)
Tensor A_sl = add_self_loops(A);

// Symmetric GCN normalization:  D̃^(-1/2) (A+I) D̃^(-1/2)
// (implicitly adds self-loops; pass raw adjacency, not add_self_loops output)
Tensor A_norm = gcn_norm(A);

// Sparse row-wise softmax over CSR non-zeros (attention weights)
Tensor alpha = edge_softmax(attention_csr);

// CSR neighborhood max-pooling: max_{j in N(i)} H[j]  per row
Tensor agg = sage_max_aggregate(A, H);

Graph I/O

#include "tinygnn/graph_loader.hpp"

// Load Cora binary (produced by scripts/train_cora.py)
GraphData data = load_cora_binary("weights/cora_graph.bin");
// data.adjacency  → SparseCSR Tensor  (N × N)
// data.features   → Dense Tensor      (N × F)
// data.labels     → std::vector<int>
// data.train_mask, val_mask, test_mask → std::vector<bool>

// Parse arbitrary CSV graphs
auto edges    = GraphLoader::parse_edges("edges.csv");
Tensor feats  = GraphLoader::parse_features("features.csv");
Tensor adj    = GraphLoader::edge_list_to_csr(edges, num_nodes);

// Full pipeline shortcut
GraphData gd  = GraphLoader::load("edges.csv", "features.csv");

Layers

All layers implement Tensor forward(const Tensor& A, const Tensor& H).

#include "tinygnn/layers.hpp"

// --- GCN (Kipf & Welling, ICLR 2017) ---
// A must be pre-normalized: A_norm = gcn_norm(raw_adj)
GCNLayer gcn(F_in, F_out, /*bias=*/true, Activation::ReLU);
gcn.set_weight(W);    // Dense Tensor F_in × F_out
gcn.set_bias(b);      // Dense Tensor 1 × F_out
Tensor H2 = gcn.forward(A_norm, H);

// --- GraphSAGE (Hamilton et al., NeurIPS 2017) ---
// A is raw (un-normalized) adjacency; module handles degree normalization
SAGELayer sage(F_in, F_out, SAGELayer::Aggregator::Mean, /*bias=*/true, Activation::ReLU);
// SAGELayer::Aggregator::Mean | Max
sage.set_weight_neigh(Wn);    // F_in × F_out
sage.set_weight_self(Ws);     // F_in × F_out
sage.set_bias(b);
Tensor H2 = sage.forward(A, H);

// --- GAT (Veličković et al., ICLR 2018) ---
// A must include self-loops: A_sl = add_self_loops(raw_adj)
GATLayer gat(F_in, F_out, /*negative_slope=*/0.2f, /*bias=*/true, Activation::None);
gat.set_weight(W);          // F_in × F_out
gat.set_attn_left(al);      // 1 × F_out  (destination attention)
gat.set_attn_right(ar);     // 1 × F_out  (source attention)
gat.set_bias(b);
Tensor H2 = gat.forward(A_sl, H);

Model (multi-layer execution graph)

#include "tinygnn/model.hpp"

Model model;

// Add layers in forward order
model.add_gcn_layer(1433, 64, /*bias=*/true, Activation::ReLU);
model.add_gcn_layer(64, 7,    /*bias=*/true, Activation::None);

// Load PyG-exported weights (TGNN binary format)
model.load_weights("weights/gcn_cora.bin");

// Run inference
GraphData cora = load_cora_binary("weights/cora_graph.bin");
Tensor logits  = model.forward(cora.adjacency, cora.features);
// → Dense Tensor (N × num_classes), call log_softmax_inplace for probabilities

Compute Ops

#include "tinygnn/ops.hpp"

// Dense GEMM:  C = A × B    (A: M×K Dense, B: K×N Dense → C: M×N Dense)
Tensor C = matmul(A, B);

// Sparse-Dense SpMM:  C = A × B    (A: M×K SparseCSR, B: K×N Dense → C: M×N Dense)
Tensor C = spmm(A, B);

// In-place activations (Dense only)
relu_inplace(X);
leaky_relu_inplace(X, /*alpha=*/0.01f);
elu_inplace(X, /*alpha=*/1.0f);
sigmoid_inplace(X);
tanh_inplace(X);
gelu_inplace(X);
softmax_inplace(X);       // row-wise; with max-subtraction stability
log_softmax_inplace(X);   // row-wise; numerically stable

// Bias broadcast  X[i][j] += bias[0][j]  for all i
add_bias(X, bias);        // bias: Dense 1 × F

CLI Reference (Python Scripts)

All scripts live under scripts/. Run with the active Python environment.

Fetch datasets

python scripts/fetch_datasets.py               # Cora + Reddit
python scripts/fetch_datasets.py --cora-only   # Cora only (~170 KB)

Writes to datasets/cora/ and datasets/reddit/ (gitignored).

Train & export Cora weights

# Prerequisite: pip install torch torch_geometric
python scripts/train_cora.py

Trains GCN, GraphSAGE (Mean), and GAT on Cora; exports weights/gcn_cora.bin, weights/sage_cora.bin, weights/gat_cora.bin, and weights/cora_graph.bin.

Validate logit-level agreement

python scripts/validate_cora.py               # weight-file accuracy check
python scripts/validate_cora.py --logit-check # fresh PyG train → logit diff

Run GNN benchmarks

python scripts/bench_gnn.py                   # 8 reps (default)
python scripts/bench_gnn.py --reps 20         # smoother medians
python scripts/bench_gnn.py --no-bench        # accuracy checks only
python scripts/bench_gnn.py --no-accuracy     # timing only
python scripts/bench_gnn.py --csv path/out.csv  # custom CSV prefix

Writes benchmarks/gnn_bench_results_timing.csv and benchmarks/gnn_bench_results_accuracy.csv.

Generate benchmark charts

python scripts/plot_gnn_bench.py    # produces 5 PNGs in benchmarks/
Output file Contents
gnn_runtime_comparison.png Wall-time per layer & graph size
gnn_memory_comparison.png Working-set memory (analytical, log scale)
gnn_speedup.png TinyGNN-NT vs TinyGNN-1T and vs PyG
gnn_scaling.png Runtime scaling with graph size
gnn_accuracy.png TinyGNN vs PyG max absolute logit diff

Thread-scaling and operator-fusion benchmarks

cmake --build build --target bench_parallel
./build/bench_parallel                        # OpenMP thread-scaling report

cmake --build build --target bench_fusion
OMP_NUM_THREADS=8 ./build/bench_fusion        # fused vs unfused GAT/SAGE

python scripts/plot_scaling.py                # → benchmarks/thread_scaling.png

Memory safety (Linux / WSL)

bash scripts/sanitizers.sh       # ASan + UBSan, 27 configs across all phases
bash scripts/valgrind_all.sh     # Memcheck + Helgrind + Callgrind

Documentation

doxygen Doxyfile                              # C++ API → docs/doxygen/html/
pip install sphinx sphinx-rtd-theme breathe
cd docs && make html                          # Python API → docs/_build/html/

GNN Benchmark Results

Machine-dependent results — all numbers vary with CPU core count, NUMA topology, memory bandwidth, and OS scheduler behaviour. Results from two different hardware environments are shown below. On machines with fewer cores, TinyGNN-MT may be slower than TinyGNN-1T for very small graphs (thread-launch overhead dominates). Run python scripts/bench_gnn.py --reps 8 && python scripts/plot_gnn_bench.py to reproduce on your own hardware.

Accuracy (TinyGNN vs PyTorch Geometric)

Synthetic graphs: N=30, F_in=8, F_out=4 — numerically identical weights fed to both engines.

Layer max|diff| mean|diff| Pass
GCNLayer 2.38e-07 4.32e-08
SAGELayer-Mean 4.77e-07 7.74e-08
SAGELayer-Max 4.77e-07 1.25e-07
GATLayer 1.79e-07 4.90e-08

All differences are well within float32 rounding tolerance (< 5×10⁻⁷).


Wall-time — Jai Ansh's machine (Intel i7-12650H · 16T · Dell G15 5520)

Hardware: Intel Core i7-12650H @ 2.30 GHz (10 cores / 16 threads), 16 GB DDR5-4800, NVIDIA RTX 3060 6 GB, Windows 11, GCC 13.3 (MinGW-UCRT64), Python 3.12. 8 timed repetitions, median reported.

Graph configs: small (N=1K, deg=5, F_in=64→32), medium (N=10K, deg=10, F_in=128→64), large (N=100K, deg=5, F_in=256→128).

Graph Layer TinyGNN-1T TinyGNN-16T PyG 16T vs 1T 16T vs PyG
small-1K GCN 0.29 ms 0.58 ms 2.05 ms 0.49× ¹ 3.51×
small-1K SAGE-Mean 0.38 ms 0.60 ms 1.65 ms 0.64× ¹ 2.77×
small-1K SAGE-Max 0.47 ms 0.78 ms 2.20 ms 0.61× ¹ 2.83×
small-1K GAT 0.46 ms 0.85 ms 2.17 ms 0.54× ¹ 2.55×
medium-10K GCN 9.74 ms 3.67 ms 33.10 ms 2.65× 9.02×
medium-10K SAGE-Mean 15.50 ms 3.17 ms 54.16 ms 4.90× 17.11×
medium-10K SAGE-Max 19.67 ms 3.63 ms 72.93 ms 5.42× 20.09×
medium-10K GAT 9.56 ms 3.98 ms 36.55 ms 2.40× 9.18×
large-100K GCN 262.59 ms 74.15 ms 597.18 ms 3.54× 8.05×
large-100K SAGE-Mean 490.37 ms 120.10 ms 594.53 ms 4.08× 4.95×
large-100K SAGE-Max 519.41 ms 89.52 ms 1010.78 ms 5.80× 11.29×
large-100K GAT 274.30 ms 73.70 ms 510.89 ms 3.72× 6.93×

¹ Small-graph thread overhead: spawning 16 threads costs more than the actual computation for N=1K graphs. Single-threaded is faster at this scale — see the speedup chart.

The charts below were generated from this hardware run:

GNN Runtime Comparison GNN Speedup GNN Memory Comparison


Wall-time — Anubhav's machine (Intel Core 9 270H · 20T)

Hardware: Intel Core Ultra 9 270H (20 logical threads), GCC 13.3, Python 3.11, Windows 11 + WSL2. 10 timed repetitions, median reported.

Graph Layer TinyGNN-1T TinyGNN-20T PyG 20T vs 1T 20T vs PyG
small-1K GCN 0.42 ms 0.84 ms 1.29 ms 0.49× ¹ 1.54×
small-1K SAGE-Mean 0.50 ms 0.73 ms 1.14 ms 0.68× ¹ 1.55×
small-1K SAGE-Max 0.46 ms 0.75 ms 1.61 ms 0.62× ¹ 2.15×
small-1K GAT 0.56 ms 1.20 ms 1.67 ms 0.47× ¹ 1.39×
medium-10K GCN 8.07 ms 3.67 ms 14.93 ms 2.20× 4.07×
medium-10K SAGE-Mean 14.23 ms 6.20 ms 24.03 ms 2.29× 3.87×
medium-10K SAGE-Max 11.70 ms 5.23 ms 35.29 ms 2.24× 6.74×
medium-10K GAT 8.62 ms 3.64 ms 18.31 ms 2.37× 5.04×
large-100K GCN 211.51 ms 55.99 ms 208.67 ms 3.78× 3.73×
large-100K SAGE-Mean 390.21 ms 82.47 ms 258.09 ms 4.73× 3.13×
large-100K SAGE-Max 390.84 ms 71.53 ms 393.57 ms 5.46× 5.50×
large-100K GAT 281.61 ms 79.85 ms 253.06 ms 3.53× 3.17×

¹ Small-graph thread overhead: on a 20-core CPU, spawning 20 threads costs more than the computation for N=1K. Single-threaded is faster at this scale.

(Important note the links and png s below are from Anubhav's system)

GNN Runtime Comparison GNN Speedup GNN Memory Comparison

(Important note the links and png s below are from Jai Ansh's system)

GNN Runtime Comparison GNN Speedup GNN Memory Comparison

Working-set memory (analytical estimate)

Memory is computed analytically from tensor dimensions rather than measured at runtime (C++ heap is invisible to Python memory profilers):

Graph Layer TinyGNN PyG estimate PyG / TinyGNN
small-1K GCN 0.58 MB 1.11 MB 1.93×
medium-10K GCN 11.36 MB 22.02 MB 1.94×
large-100K GCN 203.45 MB 403.03 MB 1.98×

TinyGNN uses roughly 2× less working-set memory than PyG across all tested scales, reflecting the absence of PyTorch tensor metadata, autograd buffers, and COO→CSR conversion overhead.


CI/CD

Every push to main and every pull request automatically triggers GitHub Actions:

  • C++ build + test on Linux (GCC + Clang), macOS, and Windows (MSVC)
  • Python build + test on Linux, macOS, Windows (Python 3.10 + 3.12)
  • ASan + UBSan sanitizer checks on Linux
  • Documentation build (Doxygen + Sphinx)

Valgrind Memcheck + Helgrind can be run locally via bash scripts/valgrind_all.sh (excluded from CI due to runtime overhead on large graph tests).


Datasets

TinyGNN tests against two real-world graph datasets. Download them with the included Python script:

# Cora only (no extra Python deps, ~170 KB download)
python3 scripts/fetch_datasets.py --cora-only

# Both Cora and Reddit (needs numpy + scipy, ~200 MB download)
pip install numpy scipy
python3 scripts/fetch_datasets.py
Dataset Source Nodes Edges Features CSV size
Cora LINQS (McCallum et al. 2000) 2,708 5,429 1,433 (binary bag-of-words) 7.5 MB
Reddit DGL / GraphSAGE (Hamilton et al. 2017) 232,965 114,615,892 602 (GloVe embeddings) 2.7 GB

Files are placed in datasets/cora/ and datasets/reddit/ (gitignored). Tests that require actual datasets skip gracefully when the files are not present.


Phase 1 -- Hybrid Tensor Core

Goal

Establish the fundamental data structure by expanding a contiguous memory design to support both dense and sparse graph layouts, without any third-party libraries.

Design

StorageFormat enum

enum class StorageFormat : uint8_t {
    Dense     = 0,   // Row-major contiguous storage
    SparseCSR = 1    // Compressed Sparse Row
};

Tensor struct layout

Dense:
  data_    -> [v00, v01, ..., v(R-1)(C-1)]   rows*cols floats, row-major
  strides_ -> {cols, 1}

SparseCSR:
  row_ptr_ -> size (rows+1)   row_ptr[i]..row_ptr[i+1] = col indices of row i
  col_ind_ -> size nnz        column index of each non-zero
  data_    -> size nnz        value of each non-zero

Memory footprint

Format Formula 1000x1000 example
Dense rows * cols * 4 bytes 4,000,000 bytes (3.8 MB)
SparseCSR nnz4 + nnz4 + (rows+1)*4 bytes 44,004 bytes with 5000 edges (43 KB)

Sparse uses ~1.1% of the memory of the equivalent dense matrix in this benchmark.

Results

Dense  1000x1000 memory          = 4,000,000 bytes
Sparse 1000x1000 (5000 nnz)      =    44,004 bytes
Memory ratio (sparse / dense)    =      1.10%
Very sparse (10 nnz in 10k x10k) =      0.01%

Test suite -- 104 assertions, 18 test functions, 0 failures

Category Functions
Dense construction and mutation 3
Memory footprint validation 2
Memory comparison / reduction ratios 2
Edge cases (empty, 1x1, 0-nnz, degenerate shapes, CSR data integrity) 6
Error handling (invalid args, out-of-bounds, wrong format access) 4
Utility (repr, enum values) 1
Total : 104
Passed: 104
Failed: 0

Phase 2 -- Graph Data Loader

Goal

Load irregular graph structures from CSV files and convert them to hardware-friendly sorted CSR format, ready for GNN inference.

Design

GraphLoader class

class GraphLoader {
public:
    // Edge-list CSV  ->  vector of (src, dst) pairs
    static std::vector<std::pair<int32_t, int32_t>>
        parse_edges(const std::string& path);

    // Feature CSV  ->  Dense tensor (num_nodes x num_features)
    static Tensor parse_features(const std::string& path);

    // Raw edge list  ->  sorted SparseCSR adjacency tensor
    static Tensor edge_list_to_csr(
        const std::vector<std::pair<int32_t, int32_t>>& edges,
        std::size_t num_nodes);

    // Full pipeline: CSV files  ->  GraphData (adjacency + features)
    static GraphData load(const std::string& edges_path,
                          const std::string& features_path);
};

CSR conversion algorithm -- O(E + V + E*log(E/V))

Step 1: Count out-degree per node         O(E)
Step 2: Build row_ptr via prefix sum      O(V)
Step 3: Fill col_ind via offset cursors   O(E)
Step 4: Sort col_ind within each row      O(E*log(E/V)) amortised

CSV format support

  • Automatic header row detection (skips non-numeric first lines)
  • Both LF and CRLF line endings
  • Whitespace-tolerant parsing
  • Out-of-order node IDs with gap zero-filling in feature tensor

Results (Cora-scale synthetic benchmark)

Nodes:     2,708
Edges:     10,556
Features:  1,433 per node

Adjacency: Tensor(2708x2708, SparseCSR, 95,284 bytes)
Features:  Tensor(2708x1433, Dense, 15,522,256 bytes)

Results (Reddit-scale synthetic benchmark)

Reddit (Hamilton et al. 2017, GraphSAGE paper): 232,965 nodes, 114,615,892 directed edges, 602 features. Generated synthetically with a fixed seed; full file pipeline writes ~1.4 GB edge CSV + ~282 MB feature CSV.

Nodes:     232,965
Edges:     114,615,892
Features:  602 per node

Adjacency: Tensor(232965x232965, SparseCSR, 917,859,000 bytes)  (~875 MB)
Features:  Tensor(232965x602,   Dense,      560,979,720 bytes)  (~535 MB)

In-memory CSR algorithm test:
  Edge generation (114M pairs):   1,350 ms
  CSR construction + sort:        5,591 ms
  Total:                          6,979 ms

Full file-pipeline test:
  Edge file write (1.4 GB):      25,491 ms
  Feature file write (282 MB):    3,748 ms
  GraphLoader::load():          119,047 ms
  Total:                        148,286 ms

Node 0 neighbors: 455 (CSR traversal matches raw CSV exactly)

Results (actual Cora dataset)

The real Cora citation network (McCallum et al. 2000): 2,708 papers linked by 5,429 directed citations, with 1,433 binary bag-of-words features per paper.

Nodes:     2,708
Edges:     5,429
Features:  1,433 per node (binary)

Adjacency: Tensor(2708x2708, SparseCSR, 54,268 bytes)     (~53 KB)
Features:  Tensor(2708x1433, Dense,     15,522,256 bytes)  (~14.8 MB)

Feature density: 49,216 / 3,880,564 entries are 1  (1.27%)
Node 0 neighbors: 3 (CSR traversal matches raw CSV exactly)

Results (actual Reddit dataset)

The real Reddit post network (Hamilton et al. 2017, GraphSAGE): 232,965 posts linked by 114,615,892 directed edges, with 602-dimensional GloVe word embeddings per post.

Nodes:     232,965
Edges:     114,615,892
Features:  602 per node (GloVe float)

Adjacency: Tensor(232965x232965, SparseCSR, 917,859,000 bytes)  (~875 MB)
Features:  Tensor(232965x602,   Dense,      560,979,720 bytes)  (~535 MB)

GraphLoader::load() time: 157 s  (WSL, Ubuntu 24.04, GCC 13.3)
Node 0 neighbors: 2,204

Test suite -- 269 assertions, 49 test functions, 0 failures

Category Functions Covers
Edge CSV parsing 6 header, no-header, CRLF, self-loops, trailing blanks
Feature CSV parsing 6 ordered, unordered, sparse IDs, negative values
CSR conversion 8 exact arrays, sort invariant, empty rows, memory footprint
Full load pipeline 4 node-0 neighbors, all-nodes verify, feature zero-expansion
Cora-scale validation 3 2708 nodes / 10556 edges / 1433 features / row_ptr invariants
Reddit-scale validation 4 232,965 nodes / 114,615,892 edges / in-memory CSR + full pipeline + node-0 match
Actual Cora dataset 4 2,708 nodes / 5,429 real citation edges / binary features / node-0 match
Actual Reddit dataset 4 232,965 nodes / 114,615,892 real edges / GloVe features / CSR invariants
Error handling 10 file-not-found, malformed lines, negative IDs, out-of-range
Total : 269
Passed: 269
Failed: 0

Phase 3 -- Dense Compute Kernels (GEMM)

Goal

Implement the baseline dense matrix multiplication needed for the GNN feature transformation H' = H x W, where H is the node feature matrix and W is the learned weight matrix.

Design

matmul function

// C = A x B
// A: Dense (M x K)   B: Dense (K x N)   C: Dense (M x N), newly allocated
Tensor matmul(const Tensor& A, const Tensor& B);

Loop order -- (i, k, j)

The implementation uses the (i, k, j) traversal order rather than the naive (i, j, k):

for i in [0, M):           // row of A and C
  for k in [0, K):         // shared dimension
    a_ik = A[i][k]         // hoisted into scalar register
    for j in [0, N):       // row of B, column of C
      C[i][j] += a_ik * B[k][j]

Why (i, k, j): A[i][k] is loop-invariant across the inner j-loop and lives in a register. B[k][j] is accessed row-sequentially (cache-friendly). The compiler can auto-vectorise the inner j-loop because the scalar a_ik eliminates a load dependency. __restrict__ pointers are used to communicate the no-aliasing guarantee to the compiler.

Complexity

Metric Value
Time complexity O(M * K * N)
FLOPs 2 * M * K * N (1 multiply + 1 add per A element)
Output allocation M * N * 4 bytes

Preconditions enforced

  • Both operands must be StorageFormat::Dense (sparse-dense SpMM is Phase 4)
  • A.cols() must equal B.rows() -- throws std::invalid_argument with both dimensions in the message
  • Zero-dimension tensors produce a valid zero-filled output

Results

matmul(128x128, 128x128)         =    1 ms
matmul(1024x32, 32x256)          =    4 ms  (output = 1024 KB)
matmul(256x256, I_256)           spot checks: 8/8 pass
matmul(5x8, 8x6) footprint       = 120 bytes (5*6*4)

GNN feature transform (H x W, one-hot nodes selecting rows of W):

H (3x4)  x  W (4x2)  ->  H' (3x2)
Node 0: [1, 2]   Node 1: [3, 4]   Node 2: [5, 6]

4x4 hardcoded result (spec required):

A = reshape([1..16], 4, 4)    B = reshape([17..32], 4, 4)

C = A x B:
  [  250,  260,  270,  280 ]
  [  618,  644,  670,  696 ]
  [  986, 1028, 1070, 1112 ]
  [ 1354, 1412, 1470, 1528 ]

Test suite -- 268 assertions, 30 test functions, 0 failures

Category Functions Covers
4x4 hardcoded result (spec) 3 full matrix, element spot-checks, 2x2 sub-block cross-check
Non-square shapes 3 (2x3)x(3x4), matrix x column-vector, row-vector x matrix
Identity and zero properties 4 AI=A, IA=A, non-square identity, zero matrix output
GNN feature transform 1 H x W with one-hot nodes verifying row selection from W
Algebraic properties 3 non-commutativity, associativity, scalar distributivity
Output tensor properties 3 format=Dense, shape MxN, footprint = MN4 bytes
Stress / scale 3 128x128 all-ones, 1024x32x256 rectangle, 256x256 identity
Dimension mismatch errors 5 basic throw, message content, 4 bad-shape combos, degenerate 0-dim
Sparse input errors 2 sparse A throws, sparse B throws (SparseCSR in message)
Edge / degenerate cases 3 1x1 scalar, all-zeros output, GNN layer chain (AB)C = A(BC)
Total : 268
Passed: 268
Failed: 0

Phase 4 -- Sparse Compute Kernels (SpMM)

Goal

Implement the sparse-dense matrix multiplication kernel that is the heart of GNN message-passing: H_agg = Adj × H, aggregating each node's neighbor features via the sparse adjacency matrix.

Design

spmm function

// C = A × B
// A: SparseCSR (M × K)   B: Dense (K × N)   C: Dense (M × N), newly allocated
Tensor spmm(const Tensor& A, const Tensor& B);

Algorithm -- CSR-SpMM

for i in [0, M):                        // row of A (node)
  for nz in [row_ptr[i], row_ptr[i+1]): // non-zeros in row i (neighbors)
    k = col_ind[nz]                     // column (neighbor node ID)
    a_val = data[nz]                    // edge weight (1.0 for unweighted)
    for j in [0, N):                    // feature dimension
      C[i][j] += a_val * B[k][j]       // accumulate neighbor's features

Why CSR-SpMM: The outer loop walks rows (nodes). The middle loop walks non-zero columns (graph neighbors) — this is the message-passing. The inner loop accumulates dense features — cache-friendly because B[k][j] is contiguous in memory and the compiler can auto-vectorise with the scalar a_val hoisted to a register.

Complexity

Metric Value
Time complexity O(nnz × N) — proportional to edges × features
FLOPs 2 × nnz × N (1 multiply + 1 add per edge per feature)
Output allocation M × N × 4 bytes

Preconditions enforced

  • A must be StorageFormat::SparseCSR (Dense A → "use matmul() instead")
  • B must be StorageFormat::Dense (Sparse B → "Sparse × Sparse not supported")
  • A.cols() must equal B.rows() — throws std::invalid_argument with dimensions in message
  • Early exit for degenerate cases (M=0, N=0, or nnz=0) producing valid zero-filled output

Results

3×3 hand-calculated SpMM (spec required):

A (SparseCSR):                   B (Dense):
  [1 1 0]  row_ptr=[0,2,3,5]      [1 2]
  [0 1 0]  col_ind=[0,1,1,0,2]    [3 4]
  [1 0 1]  values =[1,1,1,1,1]    [5 6]

C = spmm(A, B):
  [4, 6]   ← 1·[1,2] + 1·[3,4]
  [3, 4]   ← 1·[3,4]
  [6, 8]   ← 1·[1,2] + 1·[5,6]

4×4 weighted hand-calculated:

A (SparseCSR):                B (Dense):
  [2 0 0 0]  values=[2,3,     [1  0  2]
  [0 3 1 0]   1,1,2,1,1]      [0  1  0]
  [1 0 0 2]                    [3  0  1]
  [0 1 1 0]                    [0  2  0]

C = spmm(A, B):
  [2, 0, 4]  [3, 3, 1]  [1, 4, 2]  [3, 1, 1]

GNN message-passing (star graph, 5 nodes):

H_agg = spmm(Adj_star, H)
  Hub (node 0): sum of all features = [10, 40]
  Leaf i:       H[0] + H[i]  (receives from hub + self)

Equivalence with dense matmul verified for:

  • 32×32 sparse (~9% density) × 32×16 dense
  • Cora-scale: 2,708×2,708 sparse (3 nnz/row) × 2,708×32 dense

Test suite -- 306 assertions, 35 test functions, 0 failures

Category Functions Covers
3×3 hand-calculated (spec) 3 full matrix, element checks, column vector
4×4 weighted CSR 2 full result, element derivations
Non-square shapes 3 (5×3)×(3×4), (2×5)×(5×1), (1×4)×(4×3)
Identity & zero properties 4 sparse I × B = B, 64×64, zero-nnz, all-ones
GNN message-passing 3 triangle (fully connected), star graph, two-hop path chain
Dense matmul equivalence 2 small 3×3, medium 32×32 deterministic sparse
Output tensor properties 3 format=Dense, shape M×N, memory footprint
Stress / scale 3 512×512 identity, 2708-node Cora-scale, 1024×128
Wrong format errors 4 dense A throws, message check, sparse B throws, message check
Dimension mismatch 3 basic throw, message "dimension mismatch", multiple combos
Edge / degenerate cases 5 single nnz, empty rows, diagonal/self-loops, negative weights, 1×1
Total : 306
Passed: 306
Failed: 0

Phase 5 -- Activations & Utilities

Goal

Implement the element-wise activation functions and utilities needed between GNN layers: H' = Activation(Adj × H × W + b). These operate in-place on Dense tensors for zero-copy efficiency.

Design

Activation functions

Function Formula GNN Use Case
relu_inplace(X) max(0, x) GCN, GraphSAGE, GIN — standard hidden layer activation
leaky_relu_inplace(X, α) x if x≥0, else αx (default α=0.01) GAT attention coefficients (α=0.2)
elu_inplace(X, α) x if x≥0, else α(eˣ−1) (default α=1.0) EGNN, PNA — smooth negative region
sigmoid_inplace(X) 1/(1+e⁻ˣ) GGNN gating, link prediction, binary classification
tanh_inplace(X) tanh(x) GRU/LSTM graph networks (GGNN, MPNN)
gelu_inplace(X) x·Φ(x) ≈ 0.5x(1+tanh(√(2/π)(x+0.044715x³))) GPS, Graphormer, TokenGT — transformer GNNs
softmax_inplace(X) row-wise softmax with max-subtraction stability GAT attention normalization, classification output
log_softmax_inplace(X) row-wise log-softmax (numerically stable) NLL loss for node classification

Utility functions

Function Operation GNN Use Case
add_bias(X, bias) X[i][j] += bias[0][j] — broadcast (1×N) across M rows Linear layer bias: H' = AHW + b

Common properties

  • All activations modify the tensor in-place (O(1) extra memory)
  • All validate Dense format and throw std::invalid_argument on SparseCSR
  • Softmax / log-softmax use the max-subtraction trick for numerical stability
  • Sigmoid uses a two-branch implementation (x≥0 vs x<0) to prevent overflow
  • GELU uses the tanh approximation (same as PyTorch F.gelu(approximate='tanh'))

Complexity

Function Time Extra Memory
relu, leaky_relu, sigmoid, tanh O(M × N) O(1)
elu, gelu O(M × N) with exp/tanh on subset O(1)
softmax, log_softmax O(M × N) — 3 passes per row O(1)
add_bias O(M × N) O(1)

Results

GCN layer pipeline (identity Adj, 3 nodes, 2 features):

H' = ReLU(Adj × H × W + b)
  Node 0: [1.5, 0.0]   (relu clamps negative bias-shifted value)
  Node 1: [2.5, 0.0]
  Node 2: [3.5, 0.0]

GAT-style attention (LeakyReLU + Softmax):

Scores:  [0.5, -0.3, 1.2]
LeakyReLU(α=0.2): [0.5, -0.06, 1.2]
Softmax: attention weights sum to 1.0, largest score gets most weight

Node classification (Log-Softmax for NLL loss):

3 nodes × 4 classes
All log-probabilities ≤ 0, exp(row) sums to 1.0
Argmax preserved through log-softmax

Numerical stability verified:

softmax([1000, 1001])     — no NaN/Inf, correct probabilities
log_softmax([1000, 1001, 999]) — no NaN/Inf, exp sums to 1.0

Test suite -- 266 assertions, 70 test functions, 0 failures

Category Functions Covers
ReLU 5 basic, 2D matrix, all-positive, all-negative, zero preservation
Leaky ReLU 5 default α=0.01, GAT α=0.2, 2D, α=0→ReLU, α=1→identity
ELU 4 basic, custom α=2.0, saturation at −α, 2D mixed signs
Sigmoid 5 basic values, σ(x)+σ(−x)=1 symmetry, bounds (0,1), hand-calculated, 2D
Tanh 5 basic, odd symmetry, bounds (−1,1), hand-calculated, 2D
GELU 4 basic (0→0, large→x, neg→0), hand-calculated vs reference, zero symmetry, 2D
Softmax 7 row sums=1, bounds [0,1], uniform dist, hand-calculated, multi-row, numerical stability, argmax preserved
Log-Softmax 6 exp sums=1, all ≤ 0, consistent with softmax, uniform, numerical stability, multi-row
Add bias 6 basic broadcasting, zero bias, negative bias, single-row, wrong-rows throw, wrong-cols throw
GNN pipeline integration 3 full GCN layer, GAT attention, node classification
Error handling (SparseCSR) 10 all 9 activations + add_bias reject sparse input
Stress / scale 5 relu 1024×256, softmax 1024×128, sigmoid 512×512, bias 2708×32, gelu 256×256
Edge / degenerate 5 1×1 through all activations, zero-cols rejected, zero-rows no-op, double-apply idempotency
Total : 266
Passed: 266
Failed: 0

Phase 6 -- Multi-Architecture Layer Assembly

Goal

Assemble complete GNN layer architectures from the tensor primitives built in Phases 1–5, implementing the three canonical propagation strategies: GCN (spectral normalization), GraphSAGE (inductive mean/max aggregation), and GAT (attention-based message passing).

Design

Helper functions

// Â = A + I  — insert identity self-loops into CSR (two-pass: detect diag, rebuild)
Tensor add_self_loops(const Tensor& A);

// D̃^(-1/2) Â D̃^(-1/2)  — symmetric GCN normalization
Tensor gcn_norm(const Tensor& A);

// Row-wise sparse softmax over CSR non-zeros only (max-subtract, two-pass)
Tensor edge_softmax(const Tensor& A);

// Element-wise max pooling over CSR neighborhoods
Tensor sage_max_aggregate(const Tensor& A, const Tensor& H);

GCN Layer (GCNLayer)

Kipf & Welling, ICLR 2017

H' = σ( Â_norm · (H · W) + b )

  ̂   precomputed by gcn_norm():  D̃^(-1/2) (A+I) D̃^(-1/2)
A = precomputed
Forward:
  Step 1: HW  = matmul(H, W)       [N × F_in] × [F_in × F_out] → [N × F_out]
  Step 2: out = spmm(Â_norm, HW)  sparse message-passing
  Step 3: add_bias(out, b)           optional
  Step 4: σ(out)                     ReLU or None

GraphSAGE Layer (SAGELayer)

Hamilton, Ying & Leskovec, NeurIPS 2017

h_v' = σ( W_neigh · AGG({h_u : u ∈ N(v)}) + W_self · h_v + b )

AGG variants:
  Mean: spmm(A, H) / degree[v]             (normalized neighbor sum)
  Max:  element-wise max over CSR row       (sage_max_aggregate)

Forward:
  Step 1: agg    = Mean-spmm or sage_max_aggregate
  Step 2: h_neigh = matmul(agg, W_neigh)
  Step 3: h_self  = matmul(H,   W_self)
  Step 4: out    = h_neigh + h_self        (element-wise add)
  Step 5: add_bias(out, b)                 optional
  Step 6: σ(out)                            ReLU or None

GAT Layer (GATLayer)

Veličković et al., ICLR 2018

Forward (single attention head):
  Step 1: Wh          = matmul(H, W)                     [N × F_out]
  Step 2: src[i]      = a_l · Wh[i],  dst[j] = a_r · Wh[j]   per node
  Step 3: e_ij        = LeakyReLU(src[i] + dst[j])       SpSDDMM — one value per CSR edge
  Step 4: α           = edge_softmax(e_csr)               sparse softmax per row
  Step 5: out         = spmm(α, Wh) + b                  attention-weighted aggregation
  Step 6: σ(out)

Weights: W (F_in×F_out), a_l (1×F_out), a_r (1×F_out), b (1×F_out)
A must include self-loops (use add_self_loops before forward).

Phase 6a: GCN Layer

Test suite -- 764 assertions, 68 test functions, 0 failures

Category Tests Covers
add_self_loops 7 basic, already-has-diagonal, empty
gcn_norm 7 hand-verified D̃^(-1/2) Â D̃^(-1/2) values
GCNLayer construction + weight management 8 shape checks, wrong-format, wrong-shape throws
GCNLayer forward — 3-node hand-computed 6 row-by-row numerical verification
GCNLayer forward — no bias, no activation 5 identity propagation
Two-layer GCN stacking 4 output shape and feature composition
10-node ring graph — full pipeline 11 symmetry, degree normalization, activation
Error handling 12 wrong formats, dimension mismatches, error messages
Edge / degenerate cases 8 single node, isolated nodes, identity adjacency
Total : 764
Passed: 764
Failed: 0

Phase 6b: GraphSAGE Layer

Test suite -- 1,011 assertions, 60 test functions, 0 failures

Category Tests Covers
sage_max_aggregate 8 basic, empty graph, single node
SAGELayer construction + weight management 9 shape checks, aggregator type, wrong-format throws
SAGELayer Mean forward — 3-node hand-computed 7 degree normalization, W_neigh/W_self combination
SAGELayer Max forward — 3-node hand-computed 7 max pooling, W_neigh/W_self combination
Multi-layer GraphSAGE 4 two-layer composition, feature dimension flow
10-node graph tests 7 mean vs max comparison, scale, bias/activation
Error handling 10 dense A, sparse H, dimension mismatches, messages
Edge / degenerate cases 8 isolated nodes get zero aggregation
Total : 1011
Passed: 1011
Failed: 0

Phase 6c: GAT Layer

Test suite -- 1,284 assertions, 60 test functions, 0 failures

Category Tests Covers
edge_softmax 10 row sums = 1, uniform input, single-entry rows, numerical stability
GATLayer construction + weight management 12 shape checks, wrong-format, wrong-shape throws
GATLayer forward — 3-node hand-computed 10 attention score computation, softmax, weighted SpMM
GATLayer forward — bias and activation 8 bias add, ReLU gate
Multi-layer GAT + mixed layer stacking 6 GAT→GAT, GCN→GAT, SAGEMax→GAT compositions
Larger graph tests (10-node, 100-node) 8 scale, output shape, attention normalization
Error handling 10 wrong formats, dimension mismatches, negative slope validation
Edge / degenerate cases 6 self-loop-only, single node, all-zero attention
Total : 1284
Passed: 1284
Failed: 0

Phase 7 -- Multi-Model End-to-End Pipeline & Reference Matching

Goal

Close the loop from training to inference: train three GNN architectures (GCN, GraphSAGE, GAT) on the standard Cora citation dataset using PyTorch Geometric, export their FP32 weights to a custom binary format, and validate that TinyGNN's C++ inference engine reproduces the same classification accuracy.

Architecture

Python Training Pipeline (scripts/train_cora.py)

  • Loads Cora from PyG's Planetoid dataset (2,708 nodes, 10,556 directed edges, 1,433 features, 7 classes)
  • Trains three models with early stopping on validation accuracy:
    • GCN: 2-layer, 1433 → 64 → 7, ReLU, dropout 0.5
    • GraphSAGE: 2-layer Mean aggregation, 1433 → 64 → 7, ReLU, dropout 0.5
    • GAT: 2-layer, 8 heads × 8 features (concat) → 7 (single head), ELU, dropout 0.6
  • Exports binary graph data and per-model weight files

Binary Format

Graph data (cora_graph.bin):

num_nodes, num_features, num_classes, num_edges  (4 × uint32)
features   (N × F × float32, row-major)
labels     (N × int32)
train_mask, val_mask, test_mask  (N × uint8 each)
edge_src, edge_dst  (E × int32 each)

Weight files ({gcn,sage,gat}_cora.bin):

Magic "TGNN"  (4 bytes)
Version       (uint32 = 1)
TestAccuracy  (float32)
NumTensors    (uint32)
─ per tensor ─
NameLen (uint32) | Name (bytes) | Rows (uint32) | Cols (uint32) | Data (float32[])

C++ Model Class (model.hpp / model.cpp)

The Model class implements a dynamic execution graph that chains heterogeneous GNN layers:

Model model;
model.add_gcn_layer(1433, 64, true, Activation::ReLU);
model.add_gcn_layer(64, 7, true, Activation::None);
model.load_weights("weights/gcn_cora.bin");
Tensor logits = model.forward(adjacency, features);

Key features:

  • Heterogeneous layer support: GCN, SAGE, and GAT layers can be freely mixed in a single execution graph
  • Multi-head GAT: multiple independent GATLayer instances per logical layer, with concat or average aggregation
  • Automatic adjacency preprocessing: gcn_norm() for GCN layers, raw adjacency for SAGE, add_self_loops() for GAT — computed once and cached
  • Inter-layer activations: ELU, ReLU, or None applied between layers (separate from each layer's internal activation)

Weight Mapping (PyG → TinyGNN)

PyG Parameter TinyGNN Parameter Transform
GCNConv.lin.weight (out×in) GCNLayer.weight (in×out) Transpose
GCNConv.bias (out) GCNLayer.bias (1×out) Reshape
SAGEConv.lin_l.weight (out×in) SAGELayer.weight_neigh (in×out) Transpose
SAGEConv.lin_r.weight (out×in) SAGELayer.weight_self (in×out) Transpose
GATConv.lin.weight[h*F:(h+1)*F] GATLayer.weight per head (in×F) Slice + Transpose
GATConv.att_src[0,h,:] GATLayer.attn_left (1×F) Extract head
GATConv.att_dst[0,h,:] GATLayer.attn_right (1×F) Extract head

Accuracy Results

Model PyG Accuracy C++ Accuracy Match
GCN (2-layer, 64 hidden) 78.3% 78.3% Exact
GraphSAGE (2-layer, Mean, 64 hidden) 80.6% 80.6% Exact
GAT (8-head + 1-head, ELU) 80.3% 79.3% ±1% (FP rounding)

Test Suite -- 13,862 assertions, 26 test functions, 0 failures

Category Tests Covers
Data loading 3 Binary graph loader, weight file loader, data integrity
Weight shape validation 3 GCN/SAGE/GAT tensor shapes match architecture
Model API 5 Layer addition, empty model throws, missing weight throws
Small graph forward pass 4 GCN/SAGE/GAT on 3-node graph, multi-head concat
Logit sanity 4 All outputs finite, log-softmax sums to 1
Prediction distribution 3 All 7 classes present in predictions
End-to-end accuracy 4 GCN/SAGE/GAT match PyG ±5%, train accuracy ≥ 85%
Tests run:    26
Assertions:   13862
Passed:       13862
Failed:       0

Phase 8 -- Python Bridge (pybind11)

Goal

Expose the entire TinyGNN C++ inference engine to Python via pybind11, enabling researchers to import tinygnn and run GNN inference from Python with NumPy/SciPy/PyTorch interoperability.

Architecture

pybind11 Bindings (python/tinygnn_ext.cpp)

A single C++ source wrapping the full TinyGNN API:

  • Tensor — zero-copy NumPy interop via from_numpy(), to_numpy(), SciPy CSR via from_scipy_csr(), PyG edge_index via from_edge_index()
  • Ops — all 8 activations, matmul, spmm, add_bias (in-place semantics preserved)
  • Graph utilities — add_self_loops, gcn_norm, edge_softmax, sage_max_aggregate
  • Layers — GCNLayer, SAGELayer (Mean/Max), GATLayer with full weight management
  • Model — dynamic execution graph with weight loading from TGNN binary files
  • I/O — load_cora_binary, load_weight_file, GraphLoader

Build System (setup.py)

  • Uses setuptools + pybind11 C++ extension
  • MinGW cross-compilation with static GCC runtime linking (no DLL dependencies)
  • Produces _tinygnn_core.cp312-win_amd64.pyd (~558 KB)
  • Clean pip install -e . or python setup.py build_ext --inplace

Python Package (python/tinygnn/)

import tinygnn
import numpy as np

# Create tensor from NumPy
X = tinygnn.Tensor.from_numpy(np.random.randn(10, 16).astype(np.float32))

# Build and run GCN model
model = tinygnn.Model()
model.add_gcn_layer(1433, 64, activation=tinygnn.Activation.RELU)
model.add_gcn_layer(64, 7, activation=tinygnn.Activation.NONE)
model.load_weights("weights/gcn_cora.bin")
logits = model.forward(adjacency, features)
result = logits.to_numpy()  # → NumPy array

Zero-Copy Interop

Method Direction Data Type
Tensor.from_numpy(arr) NumPy → TinyGNN 2D float32/float64 array → Dense Tensor
Tensor.from_numpy_1d(arr) NumPy → TinyGNN 1D array → (1×N) Dense Tensor (bias)
Tensor.to_numpy() TinyGNN → NumPy Dense Tensor → 2D float32 array
Tensor.from_scipy_csr(mat) SciPy → TinyGNN csr_matrix → SparseCSR Tensor
Tensor.from_edge_index(ei, N) PyG → TinyGNN (2×E) int32 COO → SparseCSR Tensor
Tensor.row_ptr_numpy() TinyGNN → NumPy CSR row_ptr as int32 array
Tensor.col_ind_numpy() TinyGNN → NumPy CSR col_ind as int32 array

Cora Validation (scripts/validate_cora.py)

Two validation modes:

Mode 1 — Weight File Validation: Loads pre-trained weights, runs TinyGNN inference, verifies test accuracy within 1% of expected.

Mode 2 — Logit-Level Comparison: Trains a fresh GCN in PyG, exports weights, runs both PyG and TinyGNN on identical inputs, asserts logits match:

Metric Result
Max absolute logit diff 0.000002
Mean absolute logit diff 0.000000
Mean relative diff 0.000001
Prediction agreement 100.0%

Test Suite -- 49 pytest tests, 0 failures

Category Tests Covers
Dense Tensor 9 creation, from_numpy, to_numpy, roundtrip, repr, memory
Sparse CSR Tensor 5 creation, CSR accessors, from_scipy, from_edge_index, error
Ops (matmul/spmm) 4 identity, values, SpMM identity, SpMM aggregation
Activations 9 relu, leaky_relu, elu, sigmoid, tanh, gelu, softmax, log_softmax, add_bias
Graph utilities 4 add_self_loops, gcn_norm, edge_softmax, sage_max_aggregate
GCN Layer 3 construction, identity weight forward, ReLU forward
SAGE Layer 4 Mean/Max construction, Mean forward, Max forward
GAT Layer 2 construction, forward pass
Model 3 GCN model build+forward, SAGE model build+forward, weight file loading
Enums 3 StorageFormat, Activation, Aggregator
Error handling 3 wrong ndim, sparse-to-numpy, bad edge_index shape
Tests run:    49
Passed:       49
Failed:       0

Phase 9 -- Software-Level Parallelism (OpenMP & AVX2)

Goal

Maximize CPU throughput by adding thread-level parallelism (OpenMP) and data-level parallelism (AVX2 SIMD intrinsics) to the compute kernels, achieving near-linear speedup on multi-core CPUs.

Design

OpenMP Parallelism

All outer (row/node) loops are parallelized with OpenMP pragmas:

Kernel Pragma Schedule Rationale
matmul #pragma omp parallel for dynamic Rows are uniform, but dynamic handles cache-sharing effects
spmm #pragma omp parallel for dynamic Power-law graphs have highly variable row lengths
softmax_inplace #pragma omp parallel for dynamic Each row is independent (3 passes: max, exp+sum, normalize)
log_softmax_inplace #pragma omp parallel for dynamic Same row-independent structure as softmax
relu/elu/sigmoid/tanh/gelu #pragma omp parallel for static Element-wise ops with uniform work per iteration
add_bias #pragma omp parallel for static Uniform row-based broadcasting

AVX2 + FMA Intrinsics

The inner feature-dimension loops of matmul and spmm use 256-bit SIMD (8 floats per cycle):

// Inner loop of spmm (8-wide FMA):
const __m256 va = _mm256_set1_ps(a_val);          // broadcast edge weight
__m256 vc = _mm256_loadu_ps(crow + j);             // load 8 output features
__m256 vb = _mm256_loadu_ps(brow + j);             // load 8 neighbor features
vc = _mm256_fmadd_ps(va, vb, vc);                  // fused multiply-add
_mm256_storeu_ps(crow + j, vc);                     // store 8 results

Scalar tail handles feature dimensions not divisible by 8. Code compiles cleanly without AVX2 via #ifdef __AVX2__ guards.

Build Flags

g++ -std=c++17 -O2 -fopenmp -mavx2 -mfma ...

CMake automatically detects OpenMP and AVX2/FMA support via find_package(OpenMP) and check_cxx_compiler_flag.

Benchmark Results

Machine-dependent results — numbers below measured on Intel Core 9 270H (20 logical threads), GCC 13.3, Ubuntu 24.04 (WSL2). Thread counts, memory bandwidth, and topology will produce different speedup figures on other hardware.

Tested on Intel Core 9 270H (20 threads), GCC 13.3, Ubuntu 24.04 (WSL2):

Cora-scale (2,708 nodes x 1,433 features, avg degree 3.9)

Threads SpMM Speedup GEMM Speedup Softmax Speedup
1 6.4 ms 1.00x 500 ms 1.00x 15.4 ms 1.00x
2 4.0 ms 1.58x 250 ms 2.00x 10.4 ms 1.48x
4 3.4 ms 1.85x 167 ms 2.99x 6.6 ms 2.35x
8 2.6 ms 2.49x 114 ms 4.38x 4.6 ms 3.35x

Large-scale (50,000 nodes x 256 features, avg degree 15)

Threads SpMM Speedup Softmax Speedup
1 152 ms 1.00x 69.0 ms 1.00x
2 81 ms 1.87x 65.0 ms 1.06x
4 57 ms 2.68x 56.7 ms 1.22x
8 68 ms 2.24x 48.0 ms 1.44x

Key observations:

  • GEMM achieves super-linear scaling (4.38x on 8 threads) due to improved cache utilization when threads share L2/L3
  • SpMM scales well (2.49x on 8 threads for Cora) despite irregular memory access patterns
  • Softmax scales efficiently (3.35x on 8 threads) because rows are fully independent
  • Large-scale SpMM saturates at 4 threads — memory bandwidth becomes the bottleneck

Thread-Scaling Chart

Thread-Scaling Chart


Phase 10 -- Operator Fusion (GAT & GraphSAGE)

Goal

Fuse multi-step GNN layer operations into single row-wise loops, eliminating intermediate tensor allocations and reducing peak memory consumption while improving runtime through better data locality.

Design

GAT Fusion: SpSDDMM + edge_softmax + SpMM

The unfused GAT forward pass allocates three nnz-sized intermediate tensors:

Step Operation Intermediate Size
3 SpSDDMM edge_logits vector nnz × 4 bytes
3 Build CSR attn_csr (rp + ci + vals copies) (N+1)×4 + 2×nnz×4 bytes
4 edge_softmax alpha CSR (another full copy) (N+1)×4 + 2×nnz×4 bytes

The fused kernel replaces Steps 3–5 with a single row-wise loop:

for each node i:                                    // parallelized with OpenMP
    for j  N(i):
        e_ij = LeakyReLU(src_score[i] + dst_score[j])   // SpSDDMM
    α = softmax(e_row)                                    // edge_softmax
    out[i] = Σ_j α_ij · Wh[j]                            // SpMM (AVX2 FMA)

This eliminates ~7 × nnz bytes of intermediate allocations, replacing them with a single per-row buffer of size max_degree × 4 bytes.

SAGE Fusion: Aggregation + Dual-Matmul

The unfused SAGE forward allocates:

Step Operation Intermediate Size
1 spmm(A, H) agg tensor N × F_in × 4 bytes
2 matmul(agg, W_neigh) h_neigh tensor N × F_out × 4 bytes
3 matmul(H, W_self) h_self tensor N × F_out × 4 bytes

The fused kernel computes everything row-by-row:

for each node i:                                    // parallelized with OpenMP
    agg_row = mean({H[j] : j  N(i)})                // AVX2 accumulate
    out[i] = agg_row · W_neigh + H[i] · W_self       // dual-matmul (AVX2 FMA)

This eliminates the N × F_in aggregation tensor and the N × F_out h_self tensor, needing only an F_in-sized per-row buffer.

Benchmark Results

Machine-dependent results — fused vs. unfused timings below are representative for AVX2 CPUs with multi-level caches. The crossover point (where fusion starts winning) shifts with L2/L3 cache size and memory bandwidth.

Tested on Intel Core 9 270H (8 threads), GCC 13.3, Ubuntu 24.04 (WSL2):

GAT: Fused SpSDDMM + edge_softmax + SpMM

Configuration Unfused Fused Speedup Memory Ratio
Cora-like (2708×1433→8, deg≈5) 4,867 µs 5,391 µs 0.90× 2.8×
Medium (5000×128→64, deg≈10) 4,211 µs 3,129 µs 1.35× 1.4×
Large (10000×64→32, deg≈20) 5,717 µs 3,534 µs 1.62× 2.6×
XL (20000×32→16, deg≈30) 12,323 µs 4,726 µs 2.61× 5.6×

SAGE: Fused Aggregation + Dual-Matmul

Configuration Unfused Fused Speedup Memory Ratio
Cora-like (2708×1433→128, deg≈5) 60,259 µs 20,209 µs 2.98× 13.1×
Medium (5000×128→64, deg≈10) 6,107 µs 1,858 µs 3.29× 4.0×
Large (10000×64→32, deg≈20) 5,631 µs 3,293 µs 1.71× 4.0×
XL (20000×128→64, deg≈30) 30,015 µs 18,246 µs 1.64× 4.0×

Key observations:

  • GAT fusion scales with graph density — at deg≈30, the nnz-sized intermediates dominate and fusion saves 5.6× memory with 2.61× speedup
  • SAGE fusion excels when F_in is large — Cora layer 1 (F_in=1433) shows 13.1× memory reduction and 2.98× speedup by avoiding the N×F_in aggregate tensor
  • Cora-like GAT is slightly slower (0.90×) because the per-row softmax buffer allocation overhead dominates for very small neighborhoods (deg≈5)

Valgrind Massif Profiling

Massif heap profiling on GAT (N=10000, F_in=64, F_out=32, deg≈20, nnz≈210K):

Variant Peak Heap Savings
Unfused 10,857,004 bytes (10.35 MB)
Fused 7,420,824 bytes (7.08 MB) 3.44 MB (31.6%)

Run profiling: bash scripts/run_massif_phase9.sh


Phase 11 -- Packaging, CI/CD & Documentation

Goal

Make TinyGNN a "one-click install" package ready for JMLR MLOSS or JOSS submission, with professional CI/CD, comprehensive documentation, and PyPI packaging.

PyPI Packaging

TinyGNN ships as a standard Python package with a pybind11 C++ extension:

pip install tinygnn

Package structure:

  • pyproject.toml — PEP 517/518 metadata (name, version, authors, classifiers, dependencies)
  • setup.py — Extension build configuration (C++17 flags, OpenMP, AVX2, MinGW/MSVC/GCC support)
  • MANIFEST.in — Source distribution manifest (ensures headers, sources, docs are included)
  • LICENSE — MIT License
  • CITATION.cff — Machine-readable citation metadata for JOSS/JMLR

Classifiers:

  • Development Status :: 4 - Beta
  • Topic :: Scientific/Engineering :: Artificial Intelligence
  • License :: OSI Approved :: MIT License
  • Python 3.8–3.13 support

CI/CD (GitHub Actions)

Every push to main and every PR triggers 5 parallel jobs:

Job Platform What it does
C++ Build + Test Ubuntu (GCC + Clang), macOS, Windows (MSVC) CMake build → CTest (18,000+ assertions)
Python Build + Test Ubuntu (3.10, 3.12), macOS (3.12), Windows (3.12) pip install → pytest (49 tests)
Sanitizers Ubuntu (GCC) ASan + UBSan + leak detection (-DCMAKE_EXE_LINKER_FLAGS)
Valgrind Ubuntu (GCC) Memcheck (0 errors, 0 leaks) + Helgrind (0 data races)
Documentation Ubuntu Doxygen (C++) + Sphinx (Python) → artifact upload

Documentation

C++ API (Doxygen):

  • Configured via Doxyfile in project root
  • Extracts from annotated headers in include/tinygnn/
  • Generates HTML + XML (for Breathe cross-reference)
  • Class diagrams, include graphs (via Graphviz)
  • Build: doxygen Doxyfiledocs/doxygen/html/

Python API (Sphinx):

  • Read the Docs theme (sphinx-rtd-theme)
  • Cross-references C++ docs via Breathe bridge
  • Pages: Installation, Quick Start, Python API, C++ API, Benchmarks, Contributing
  • Build: cd docs && make htmldocs/_build/html/

Authors

Jai Ansh Singh Bindra and Anubhav Choudhery (under JBAC EdTech)


Cumulative Test Statistics

Phase Test file Functions Assertions Result
1 test_tensor.cpp 18 104 104 / 104
2 test_graph_loader.cpp 49 269 269 / 269
3 test_matmul.cpp 30 268 268 / 268
4 test_spmm.cpp 35 306 306 / 306
5 test_activations.cpp 70 266 266 / 266
6a test_gcn.cpp 68 764 764 / 764
6b test_graphsage.cpp 60 1,011 1,011 / 1,011
6c test_gat.cpp 60 1,284 1,284 / 1,284
7 test_e2e.cpp 26 13,862 13,862 / 13,862
8 test_python_bindings.py 49 49 (pytest) 49 / 49
Total 465 18,183 18,183 / 18,183

Memory Safety

All seven phases pass the full sanitizer matrix with zero errors:

Config Tensor Graph Loader Matmul SpMM Activations GCN GraphSAGE GAT E2E
ASan + LSan Pass Pass Pass Pass Pass Pass Pass Pass Pass
UBSan Pass Pass Pass Pass Pass Pass Pass Pass Pass
ASan + UBSan combined Pass Pass Pass Pass Pass Pass Pass Pass Pass
Valgrind Memcheck 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks 0 errors, 0 leaks
Valgrind Helgrind 0 races 0 races 0 races 0 races 0 races 0 races 0 races 0 races 0 races

Design Principles

  • Zero dependencies -- no Eigen, no BLAS, no Boost, no external libraries of any kind
  • Exact memory accounting -- memory_footprint_bytes() returns precise byte counts for both storage formats
  • Fail-fast validation -- CSR construction validates row_ptr monotonicity, column index bounds, and size consistency; matmul validates format and inner dimensions before allocating output
  • Unified interface -- Dense and SparseCSR tensors share the same Tensor type; format is a runtime tag, not a separate class hierarchy
  • Cache-aware loop ordering -- GEMM uses (i,k,j) order with a register-hoisted scalar to enable compiler auto-vectorisation on the inner loop
  • Explicit SIMD -- AVX2 FMA intrinsics (_mm256_fmadd_ps) in matmul and SpMM inner loops give 8-wide float throughput with scalar fallback for portability
  • Thread-level parallelism -- OpenMP parallel for on all outer loops (rows/nodes) with schedule(dynamic) for load-balanced graph kernels
  • Sparse-native message passing -- SpMM walks the CSR structure directly (row_ptr → col_ind → accumulate), avoiding any sparse-to-dense conversion
  • In-place activations -- all 8 activation functions modify tensors in-place with O(1) extra memory, avoiding unnecessary allocations in GNN inference pipelines
  • Numerical stability -- softmax/log-softmax use the max-subtraction trick; sigmoid uses a two-branch formula to prevent overflow for extreme inputs
  • Operator fusion -- GAT fuses SpSDDMM + edge_softmax + SpMM into a single row-wise loop (eliminating nnz-sized intermediate CSR tensors); SAGE fuses aggregation + dual-matmul (eliminating N×F_in intermediate)
  • One-click install -- pip install tinygnn builds the C++ extension, links statically, and ships a clean Python package; GitHub Actions CI/CD validates every commit across 4 platforms

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinygnn-0.1.5.tar.gz (251.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinygnn-0.1.5-cp312-cp312-win_amd64.whl (795.1 kB view details)

Uploaded CPython 3.12Windows x86-64

File details

Details for the file tinygnn-0.1.5.tar.gz.

File metadata

  • Download URL: tinygnn-0.1.5.tar.gz
  • Upload date:
  • Size: 251.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for tinygnn-0.1.5.tar.gz
Algorithm Hash digest
SHA256 00be0801b581db50989ff5241a938c745d6658de25a969c7d76f26214b842ef0
MD5 5d9914e209bd8b83afa21b6353d943c8
BLAKE2b-256 b6c8a6a25be340d629adbbc282b5655dab49478818aba104f49b7db95c9e77bc

See more details on using hashes here.

File details

Details for the file tinygnn-0.1.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: tinygnn-0.1.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 795.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for tinygnn-0.1.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a062f6309d0d89613b12e610ff64676337b4ee20e6d73951119e6bcf06ed21b3
MD5 f591e9a8a932dd97c414bbd2d655814c
BLAKE2b-256 286033734bed1014734516794385153ffbaab7005d3c309c7ef73bb5c271a97a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page