TinyGNN — A zero-dependency C++17 inference engine for Sparse Graph Neural Networks with Python bindings
Project description
TinyGNN
A dependency-free C++ inference engine optimized for Sparse Graph Neural Networks.
TinyGNN pivots from a dense-only deep learning runtime to a sparse-native architecture, demonstrating low-level systems programming, memory architecture optimizations, and parallel computing -- with zero external dependencies.
Project Roadmap
| Phase | Title | Status |
|---|---|---|
| 1 | Hybrid Tensor Core | Complete |
| 2 | Graph Data Loader | Complete |
| 3 | Dense Compute Kernels (GEMM) | Complete |
| 4 | Sparse Compute Kernels (SpMM) | Complete |
| 5 | Activations & Utilities | Complete |
| 6 | Multi-Architecture Layer Assembly | Complete |
| 7 | Multi-Model End-to-End Pipeline & Reference Matching | Complete |
| 8 | Python Bridge (pybind11) | Complete |
| 9 | Software-Level Parallelism (OpenMP & AVX2) | Complete |
| 10 | Operator Fusion (GAT & GraphSAGE) | Complete |
| 11 | Packaging, CI/CD & Documentation | Complete |
Project Structure
TinyGNN/
├── CMakeLists.txt
├── setup.py # pybind11 Python extension build
├── pyproject.toml # Phase 11: PEP 517/518 packaging metadata (PyPI)
├── MANIFEST.in # Phase 11: source distribution manifest
├── LICENSE # Phase 11: MIT license
├── CITATION.cff # Phase 11: JOSS/JMLR citation metadata
├── CONTRIBUTING.md # Phase 11: contributor guidelines
├── Doxyfile # Phase 11: Doxygen C++ documentation config
├── .github/
│ └── workflows/
│ └── ci.yml # Phase 11: GitHub Actions CI/CD pipeline
├── docs/ # Phase 11: Sphinx documentation
│ ├── conf.py # Sphinx configuration
│ ├── index.rst # Documentation home page
│ ├── installation.rst # Installation guide
│ ├── quickstart.rst # Quick start tutorial
│ ├── python_api.rst # Python API reference
│ ├── cpp_api.rst # C++ API reference (Breathe + Doxygen)
│ ├── benchmarks.rst # Benchmark results
│ └── Makefile # Sphinx build helper
├── benchmarks/
│ ├── bench_parallel.cpp # Phase 9: OpenMP + AVX2 thread-scaling benchmark
│ ├── bench_fusion.cpp # Phase 10: fused vs. unfused GAT/SAGE runtime + memory
│ ├── massif_gat_fused.cpp # Phase 10: Massif profiling (fused GAT heap)
│ ├── massif_gat_unfused.cpp # Phase 10: Massif profiling (unfused GAT heap)
│ └── thread_scaling.png # Phase 9: generated thread-scaling chart
├── include/
│ └── tinygnn/
│ ├── tensor.hpp # StorageFormat enum + Tensor struct
│ ├── graph_loader.hpp # GraphData + GraphLoader (CSV -> CSR pipeline)
│ ├── ops.hpp # compute kernels + activations: matmul, spmm, relu, softmax, etc.
│ ├── layers.hpp # GCN, GraphSAGE, GAT layer structs + helpers
│ └── model.hpp # Model execution graph + binary weight/graph loaders
├── src/
│ ├── tensor.cpp
│ ├── graph_loader.cpp # CSV parser + edge-list-to-CSR conversion
│ ├── ops.cpp # matmul, spmm, relu, leaky_relu, elu, sigmoid, tanh, gelu, softmax, log_softmax, add_bias
│ ├── layers.cpp # add_self_loops, gcn_norm, GCNLayer, SAGELayer, GATLayer, edge_softmax
│ └── model.cpp # Model::forward(), load_cora_binary(), load_weight_file()
├── python/
│ ├── tinygnn_ext.cpp # Phase 8: pybind11 C++ bindings (280+ lines)
│ └── tinygnn/
│ └── __init__.py # Phase 8: Python package re-exports
├── tests/
│ ├── test_tensor.cpp # Phase 1 -- 104 assertions
│ ├── test_graph_loader.cpp # Phase 2 -- 269 assertions
│ ├── test_matmul.cpp # Phase 3 -- 268 assertions
│ ├── test_spmm.cpp # Phase 4 -- 306 assertions
│ ├── test_activations.cpp # Phase 5 -- 266 assertions
│ ├── test_gcn.cpp # Phase 6a -- 764 assertions
│ ├── test_graphsage.cpp # Phase 6b -- 1,011 assertions
│ ├── test_gat.cpp # Phase 6c -- 1,284 assertions
│ ├── test_e2e.cpp # Phase 7 -- 13,862 assertions (end-to-end pipeline)
│ └── test_python_bindings.py # Phase 8 -- 49 pytest tests (Python bridge)
├── datasets/ # downloaded via scripts/fetch_datasets.py
│ ├── cora/ # 2,708 nodes, 5,429 edges, 1,433 features
│ └── reddit/ # 232,965 nodes, 114,615,892 edges, 602 features
├── weights/ # exported by train_cora.py (gitignored)
│ ├── cora_graph.bin # Cora graph in binary (features, labels, masks, edges)
│ ├── gcn_cora.bin # GCN trained weights (~78% test accuracy)
│ ├── sage_cora.bin # GraphSAGE trained weights (~81% test accuracy)
│ └── gat_cora.bin # GAT trained weights (~80% test accuracy)
└── scripts/
├── install_wsl_tools.sh # one-time WSL tooling setup
├── fetch_datasets.py # download Cora + Reddit, convert to CSV
├── train_cora.py # PyG training: GCN/SAGE/GAT → binary weight export
├── validate_cora.py # PyG ↔ TinyGNN logit-level comparison
├── plot_scaling.py # thread-scaling chart generator (matplotlib)
├── run_massif_phase9.sh # Massif memory profiling (fused vs unfused)
├── build_python.py # build Python extension + run tests
├── sanitizers.sh # ASan + UBSan (27 configs, all phases)
└── valgrind_all.sh # Memcheck + Helgrind + Callgrind (all phases)
Installation
Option 1 — PyPI (Python users)
pip install tinygnn
Installs the pre-built Python extension wheel (_tinygnn_core) plus the tinygnn Python package. Supports Python 3.8–3.13 on Linux, macOS, and Windows (x86-64).
Development install from source
git clone https://github.com/AnubhavChoudhery/TinyGNN.git
cd TinyGNN
pip install -e ".[dev]" # builds the C++ extension in-place + installs dev deps
Verify the install
python -c "import tinygnn; print(tinygnn.__version__)" # → 0.1.5
python -m pytest tests/test_python_bindings.py -v # 49 tests
python scripts/validate_cora.py --logit-check # PyG logit comparison
Option 3 — Build from source (C++ only)
Prerequisites
- C++17-capable compiler: GCC 8+, Clang 7+, MSVC 2019+
- CMake 3.16+ (optional — g++ directly also works)
- OpenMP (bundled with GCC; for macOS use
brew install libomp) - CPU with AVX2 + FMA (Intel Haswell+ / AMD Zen+; scalar fallback always available)
With CMake
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel
ctest --test-dir build -C Release --output-on-failure
CMake options:
| Option | Default | Description |
|---|---|---|
TINYGNN_BUILD_TESTS |
ON |
Build test executables |
TINYGNN_BUILD_BENCHMARKS |
ON |
Build C++ benchmark executables |
Directly with g++ (single test example)
g++ -std=c++17 -O2 -fopenmp -mavx2 -mfma -Wall -Wextra -I include \
src/tensor.cpp src/graph_loader.cpp src/ops.cpp src/layers.cpp src/model.cpp \
tests/test_tensor.cpp -o build/test_tensor && ./build/test_tensor
Python SDK Reference
Install: pip install tinygnn
Tensor
import tinygnn
import numpy as np
from scipy.sparse import csr_matrix
# Dense tensors from NumPy (float32)
arr = np.random.randn(10, 16).astype(np.float32)
X = tinygnn.Tensor.from_numpy(arr)
print(X.rows, X.cols) # 10, 16
out = X.to_numpy() # → np.ndarray shape (10, 16)
# 1-D arrays (bias vectors) → (1 × N) Tensor
b = tinygnn.Tensor.from_numpy_1d(np.zeros(16, dtype=np.float32))
# Sparse CSR from SciPy
mat = csr_matrix(...)
A = tinygnn.Tensor.from_scipy_csr(mat)
# Sparse CSR from PyG edge_index
import torch
ei = torch.randint(0, 100, (2, 500))
A = tinygnn.Tensor.from_edge_index(ei, num_nodes=100)
# Inspect CSR internals
rp = A.row_ptr_numpy() # → np.ndarray int32
ci = A.col_ind_numpy() # → np.ndarray int32
Ops
# Dense GEMM
C = tinygnn.matmul(A, B) # Dense A (M×K), Dense B (K×N) → Dense C (M×N)
# Sparse-Dense SpMM
C = tinygnn.spmm(A, B) # SparseCSR A (M×K), Dense B (K×N) → Dense C (M×N)
# In-place activations
tinygnn.relu_inplace(X)
tinygnn.leaky_relu_inplace(X, alpha=0.01)
tinygnn.elu_inplace(X, alpha=1.0)
tinygnn.sigmoid_inplace(X)
tinygnn.tanh_inplace(X)
tinygnn.gelu_inplace(X)
tinygnn.softmax_inplace(X) # row-wise, numerically stable
tinygnn.log_softmax_inplace(X) # row-wise log-softmax
# Bias broadcast (X[i][j] += bias[0][j])
tinygnn.add_bias(X, bias)
Graph utilities
A_sl = tinygnn.add_self_loops(A) # insert identity self-loops
A_norm = tinygnn.gcn_norm(A) # D̃^(-1/2)(A+I)D̃^(-1/2), adds self-loops
alpha = tinygnn.edge_softmax(attn_csr) # sparse row-wise softmax
agg = tinygnn.sage_max_aggregate(A, H) # CSR neighborhood max-pool
Layers
import tinygnn
# GCN layer (A must be gcn_norm output)
gcn = tinygnn.GCNLayer(F_in=64, F_out=32, bias=True,
activation=tinygnn.Activation.RELU)
gcn.set_weight(tinygnn.Tensor.from_numpy(W)) # Dense (F_in × F_out)
gcn.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = gcn.forward(A_norm, H)
# GraphSAGE layer
sage = tinygnn.SAGELayer(F_in=64, F_out=32,
aggregator=tinygnn.SAGELayer.Aggregator.Mean,
bias=True,
activation=tinygnn.Activation.RELU)
# Aggregator.Mean | Aggregator.Max
sage.set_weight_neigh(tinygnn.Tensor.from_numpy(Wn))
sage.set_weight_self(tinygnn.Tensor.from_numpy(Ws))
sage.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = sage.forward(A, H) # A is raw adjacency (no gcn_norm)
# GAT layer (A must include self-loops via add_self_loops)
gat = tinygnn.GATLayer(F_in=64, F_out=32, negative_slope=0.2,
bias=True, activation=tinygnn.Activation.NONE)
gat.set_weight(tinygnn.Tensor.from_numpy(W))
gat.set_attn_left(tinygnn.Tensor.from_numpy(al))
gat.set_attn_right(tinygnn.Tensor.from_numpy(ar))
gat.set_bias(tinygnn.Tensor.from_numpy_1d(b))
H2 = gat.forward(A_sl, H)
End-to-end: two-layer GCN on Cora
import tinygnn
import numpy as np
# Load pre-trained weights (produced by scripts/train_cora.py)
cora = tinygnn.load_cora_binary("weights/cora_graph.bin")
model = tinygnn.Model()
model.add_gcn_layer(1433, 64, bias=True, activation=tinygnn.Activation.RELU)
model.add_gcn_layer(64, 7, bias=True, activation=tinygnn.Activation.NONE)
model.load_weights("weights/gcn_cora.bin")
logits = model.forward(cora.adjacency, cora.features) # Dense (2708 × 7)
predictions = logits.to_numpy().argmax(axis=1) # (2708,) int
test_mask = np.array(cora.test_mask)
labels = np.array(cora.labels)
accuracy = (predictions[test_mask] == labels[test_mask]).mean()
print(f"Test accuracy: {accuracy:.1%}") # → ~78%
Enumerations
tinygnn.StorageFormat.Dense # 0
tinygnn.StorageFormat.SparseCSR # 1
tinygnn.Activation.NONE
tinygnn.Activation.RELU
tinygnn.Activation.ELU
tinygnn.Activation.SIGMOID
tinygnn.Activation.TANH
tinygnn.Activation.LEAKY_RELU
tinygnn.Activation.GELU
tinygnn.Activation.SOFTMAX
tinygnn.Activation.LOG_SOFTMAX
tinygnn.SAGELayer.Aggregator.Mean
tinygnn.SAGELayer.Aggregator.Max
GraphLoader (Python)
from tinygnn import GraphLoader
# Parse CSVs directly in Python
edges = GraphLoader.parse_edges("datasets/cora/edges.csv")
feats = GraphLoader.parse_features("datasets/cora/features.csv")
adj = GraphLoader.edge_list_to_csr(edges, num_nodes=2708)
# Or use the full pipeline
data = GraphLoader.load("datasets/cora/edges.csv",
"datasets/cora/features.csv")
SDK Reference (C++ API)
Include the top-level headers — all APIs are in the global namespace:
#include "tinygnn/tensor.hpp" // Tensor, StorageFormat
#include "tinygnn/graph_loader.hpp" // GraphLoader, GraphData
#include "tinygnn/ops.hpp" // matmul, spmm, activations, utilities
#include "tinygnn/layers.hpp" // GCNLayer, SAGELayer, GATLayer
#include "tinygnn/model.hpp" // Model, load_cora_binary, load_weight_file
Link with tinygnn::tinygnn_core (CMake) or compile with all .cpp files under src/.
Tensor
// --- Construction (Dense) ---
Tensor t(rows, cols); // zero-initialized row-major float32
Tensor t(rows, cols, {v00, v01, ...}); // from initializer list
// --- Construction (SparseCSR) ---
Tensor t(rows, cols,
std::vector<float> values,
std::vector<int32_t> row_ptr, // size rows+1
std::vector<int32_t> col_ind);
// --- Inspection ---
t.rows(); // number of rows
t.cols(); // number of columns
t.nnz(); // non-zeros (SparseCSR only)
t.format(); // StorageFormat::Dense | SparseCSR
t.memory_footprint_bytes(); // exact byte count including all arrays
// --- Element access (Dense only) ---
float v = t.at(i, j); // bounds-checked read
t.set(i, j, value); // bounds-checked write
float* p = t.data_ptr(); // raw pointer (row-major)
// --- CSR accessors (SparseCSR only) ---
const float* t.csr_values();
const int32_t* t.csr_row_ptr();
const int32_t* t.csr_col_ind();
Graph Utilities
#include "tinygnn/ops.hpp"
// Add identity self-loops (modifies CSR structure)
Tensor A_sl = add_self_loops(A);
// Symmetric GCN normalization: D̃^(-1/2) (A+I) D̃^(-1/2)
// (implicitly adds self-loops; pass raw adjacency, not add_self_loops output)
Tensor A_norm = gcn_norm(A);
// Sparse row-wise softmax over CSR non-zeros (attention weights)
Tensor alpha = edge_softmax(attention_csr);
// CSR neighborhood max-pooling: max_{j in N(i)} H[j] per row
Tensor agg = sage_max_aggregate(A, H);
Graph I/O
#include "tinygnn/graph_loader.hpp"
// Load Cora binary (produced by scripts/train_cora.py)
GraphData data = load_cora_binary("weights/cora_graph.bin");
// data.adjacency → SparseCSR Tensor (N × N)
// data.features → Dense Tensor (N × F)
// data.labels → std::vector<int>
// data.train_mask, val_mask, test_mask → std::vector<bool>
// Parse arbitrary CSV graphs
auto edges = GraphLoader::parse_edges("edges.csv");
Tensor feats = GraphLoader::parse_features("features.csv");
Tensor adj = GraphLoader::edge_list_to_csr(edges, num_nodes);
// Full pipeline shortcut
GraphData gd = GraphLoader::load("edges.csv", "features.csv");
Layers
All layers implement Tensor forward(const Tensor& A, const Tensor& H).
#include "tinygnn/layers.hpp"
// --- GCN (Kipf & Welling, ICLR 2017) ---
// A must be pre-normalized: A_norm = gcn_norm(raw_adj)
GCNLayer gcn(F_in, F_out, /*bias=*/true, Activation::ReLU);
gcn.set_weight(W); // Dense Tensor F_in × F_out
gcn.set_bias(b); // Dense Tensor 1 × F_out
Tensor H2 = gcn.forward(A_norm, H);
// --- GraphSAGE (Hamilton et al., NeurIPS 2017) ---
// A is raw (un-normalized) adjacency; module handles degree normalization
SAGELayer sage(F_in, F_out, SAGELayer::Aggregator::Mean, /*bias=*/true, Activation::ReLU);
// SAGELayer::Aggregator::Mean | Max
sage.set_weight_neigh(Wn); // F_in × F_out
sage.set_weight_self(Ws); // F_in × F_out
sage.set_bias(b);
Tensor H2 = sage.forward(A, H);
// --- GAT (Veličković et al., ICLR 2018) ---
// A must include self-loops: A_sl = add_self_loops(raw_adj)
GATLayer gat(F_in, F_out, /*negative_slope=*/0.2f, /*bias=*/true, Activation::None);
gat.set_weight(W); // F_in × F_out
gat.set_attn_left(al); // 1 × F_out (destination attention)
gat.set_attn_right(ar); // 1 × F_out (source attention)
gat.set_bias(b);
Tensor H2 = gat.forward(A_sl, H);
Model (multi-layer execution graph)
#include "tinygnn/model.hpp"
Model model;
// Add layers in forward order
model.add_gcn_layer(1433, 64, /*bias=*/true, Activation::ReLU);
model.add_gcn_layer(64, 7, /*bias=*/true, Activation::None);
// Load PyG-exported weights (TGNN binary format)
model.load_weights("weights/gcn_cora.bin");
// Run inference
GraphData cora = load_cora_binary("weights/cora_graph.bin");
Tensor logits = model.forward(cora.adjacency, cora.features);
// → Dense Tensor (N × num_classes), call log_softmax_inplace for probabilities
Compute Ops
#include "tinygnn/ops.hpp"
// Dense GEMM: C = A × B (A: M×K Dense, B: K×N Dense → C: M×N Dense)
Tensor C = matmul(A, B);
// Sparse-Dense SpMM: C = A × B (A: M×K SparseCSR, B: K×N Dense → C: M×N Dense)
Tensor C = spmm(A, B);
// In-place activations (Dense only)
relu_inplace(X);
leaky_relu_inplace(X, /*alpha=*/0.01f);
elu_inplace(X, /*alpha=*/1.0f);
sigmoid_inplace(X);
tanh_inplace(X);
gelu_inplace(X);
softmax_inplace(X); // row-wise; with max-subtraction stability
log_softmax_inplace(X); // row-wise; numerically stable
// Bias broadcast X[i][j] += bias[0][j] for all i
add_bias(X, bias); // bias: Dense 1 × F
CLI Reference (Python Scripts)
All scripts live under scripts/. Run with the active Python environment.
Fetch datasets
python scripts/fetch_datasets.py # Cora + Reddit
python scripts/fetch_datasets.py --cora-only # Cora only (~170 KB)
Writes to datasets/cora/ and datasets/reddit/ (gitignored).
Train & export Cora weights
# Prerequisite: pip install torch torch_geometric
python scripts/train_cora.py
Trains GCN, GraphSAGE (Mean), and GAT on Cora; exports weights/gcn_cora.bin, weights/sage_cora.bin, weights/gat_cora.bin, and weights/cora_graph.bin.
Validate logit-level agreement
python scripts/validate_cora.py # weight-file accuracy check
python scripts/validate_cora.py --logit-check # fresh PyG train → logit diff
Run GNN benchmarks
python scripts/bench_gnn.py # 8 reps (default)
python scripts/bench_gnn.py --reps 20 # smoother medians
python scripts/bench_gnn.py --no-bench # accuracy checks only
python scripts/bench_gnn.py --no-accuracy # timing only
python scripts/bench_gnn.py --csv path/out.csv # custom CSV prefix
Writes benchmarks/gnn_bench_results_timing.csv and benchmarks/gnn_bench_results_accuracy.csv.
Generate benchmark charts
python scripts/plot_gnn_bench.py # produces 5 PNGs in benchmarks/
| Output file | Contents |
|---|---|
gnn_runtime_comparison.png |
Wall-time per layer & graph size |
gnn_memory_comparison.png |
Working-set memory (analytical, log scale) |
gnn_speedup.png |
TinyGNN-NT vs TinyGNN-1T and vs PyG |
gnn_scaling.png |
Runtime scaling with graph size |
gnn_accuracy.png |
TinyGNN vs PyG max absolute logit diff |
Thread-scaling and operator-fusion benchmarks
cmake --build build --target bench_parallel
./build/bench_parallel # OpenMP thread-scaling report
cmake --build build --target bench_fusion
OMP_NUM_THREADS=8 ./build/bench_fusion # fused vs unfused GAT/SAGE
python scripts/plot_scaling.py # → benchmarks/thread_scaling.png
Memory safety (Linux / WSL)
bash scripts/sanitizers.sh # ASan + UBSan, 27 configs across all phases
bash scripts/valgrind_all.sh # Memcheck + Helgrind + Callgrind
Documentation
doxygen Doxyfile # C++ API → docs/doxygen/html/
pip install sphinx sphinx-rtd-theme breathe
cd docs && make html # Python API → docs/_build/html/
GNN Benchmark Results
Machine-dependent results — all numbers vary with CPU core count, NUMA topology, memory bandwidth, and OS scheduler behaviour. Results from two different hardware environments are shown below. On machines with fewer cores, TinyGNN-MT may be slower than TinyGNN-1T for very small graphs (thread-launch overhead dominates). Run
python scripts/bench_gnn.py --reps 8 && python scripts/plot_gnn_bench.pyto reproduce on your own hardware.
Accuracy (TinyGNN vs PyTorch Geometric)
Synthetic graphs: N=30, F_in=8, F_out=4 — numerically identical weights fed to both engines.
| Layer | max|diff| | mean|diff| | Pass |
|---|---|---|---|
| GCNLayer | 2.38e-07 | 4.32e-08 | ✅ |
| SAGELayer-Mean | 4.77e-07 | 7.74e-08 | ✅ |
| SAGELayer-Max | 4.77e-07 | 1.25e-07 | ✅ |
| GATLayer | 1.79e-07 | 4.90e-08 | ✅ |
All differences are well within float32 rounding tolerance (< 5×10⁻⁷).
Wall-time — Jai Ansh's machine (Intel i7-12650H · 16T · Dell G15 5520)
Hardware: Intel Core i7-12650H @ 2.30 GHz (10 cores / 16 threads), 16 GB DDR5-4800, NVIDIA RTX 3060 6 GB, Windows 11, GCC 13.3 (MinGW-UCRT64), Python 3.12. 8 timed repetitions, median reported.
Graph configs: small (N=1K, deg=5, F_in=64→32), medium (N=10K, deg=10, F_in=128→64), large (N=100K, deg=5, F_in=256→128).
| Graph | Layer | TinyGNN-1T | TinyGNN-16T | PyG | 16T vs 1T | 16T vs PyG |
|---|---|---|---|---|---|---|
| small-1K | GCN | 0.29 ms | 0.58 ms | 2.05 ms | 0.49× ¹ | 3.51× |
| small-1K | SAGE-Mean | 0.38 ms | 0.60 ms | 1.65 ms | 0.64× ¹ | 2.77× |
| small-1K | SAGE-Max | 0.47 ms | 0.78 ms | 2.20 ms | 0.61× ¹ | 2.83× |
| small-1K | GAT | 0.46 ms | 0.85 ms | 2.17 ms | 0.54× ¹ | 2.55× |
| medium-10K | GCN | 9.74 ms | 3.67 ms | 33.10 ms | 2.65× | 9.02× |
| medium-10K | SAGE-Mean | 15.50 ms | 3.17 ms | 54.16 ms | 4.90× | 17.11× |
| medium-10K | SAGE-Max | 19.67 ms | 3.63 ms | 72.93 ms | 5.42× | 20.09× |
| medium-10K | GAT | 9.56 ms | 3.98 ms | 36.55 ms | 2.40× | 9.18× |
| large-100K | GCN | 262.59 ms | 74.15 ms | 597.18 ms | 3.54× | 8.05× |
| large-100K | SAGE-Mean | 490.37 ms | 120.10 ms | 594.53 ms | 4.08× | 4.95× |
| large-100K | SAGE-Max | 519.41 ms | 89.52 ms | 1010.78 ms | 5.80× | 11.29× |
| large-100K | GAT | 274.30 ms | 73.70 ms | 510.89 ms | 3.72× | 6.93× |
¹ Small-graph thread overhead: spawning 16 threads costs more than the actual computation for N=1K graphs. Single-threaded is faster at this scale — see the speedup chart.
The charts below were generated from this hardware run:
Wall-time — Anubhav's machine (Intel Core 9 270H · 20T)
Hardware: Intel Core Ultra 9 270H (20 logical threads), GCC 13.3, Python 3.11, Windows 11 + WSL2. 10 timed repetitions, median reported.
| Graph | Layer | TinyGNN-1T | TinyGNN-20T | PyG | 20T vs 1T | 20T vs PyG |
|---|---|---|---|---|---|---|
| small-1K | GCN | 0.42 ms | 0.84 ms | 1.29 ms | 0.49× ¹ | 1.54× |
| small-1K | SAGE-Mean | 0.50 ms | 0.73 ms | 1.14 ms | 0.68× ¹ | 1.55× |
| small-1K | SAGE-Max | 0.46 ms | 0.75 ms | 1.61 ms | 0.62× ¹ | 2.15× |
| small-1K | GAT | 0.56 ms | 1.20 ms | 1.67 ms | 0.47× ¹ | 1.39× |
| medium-10K | GCN | 8.07 ms | 3.67 ms | 14.93 ms | 2.20× | 4.07× |
| medium-10K | SAGE-Mean | 14.23 ms | 6.20 ms | 24.03 ms | 2.29× | 3.87× |
| medium-10K | SAGE-Max | 11.70 ms | 5.23 ms | 35.29 ms | 2.24× | 6.74× |
| medium-10K | GAT | 8.62 ms | 3.64 ms | 18.31 ms | 2.37× | 5.04× |
| large-100K | GCN | 211.51 ms | 55.99 ms | 208.67 ms | 3.78× | 3.73× |
| large-100K | SAGE-Mean | 390.21 ms | 82.47 ms | 258.09 ms | 4.73× | 3.13× |
| large-100K | SAGE-Max | 390.84 ms | 71.53 ms | 393.57 ms | 5.46× | 5.50× |
| large-100K | GAT | 281.61 ms | 79.85 ms | 253.06 ms | 3.53× | 3.17× |
¹ Small-graph thread overhead: on a 20-core CPU, spawning 20 threads costs more than the computation for N=1K. Single-threaded is faster at this scale.
(Important note the links and png s below are from Anubhav's system)
(Important note the links and png s below are from Jai Ansh's system)
Working-set memory (analytical estimate)
Memory is computed analytically from tensor dimensions rather than measured at runtime (C++ heap is invisible to Python memory profilers):
| Graph | Layer | TinyGNN | PyG estimate | PyG / TinyGNN |
|---|---|---|---|---|
| small-1K | GCN | 0.58 MB | 1.11 MB | 1.93× |
| medium-10K | GCN | 11.36 MB | 22.02 MB | 1.94× |
| large-100K | GCN | 203.45 MB | 403.03 MB | 1.98× |
TinyGNN uses roughly 2× less working-set memory than PyG across all tested scales, reflecting the absence of PyTorch tensor metadata, autograd buffers, and COO→CSR conversion overhead.
CI/CD
Every push to main and every pull request automatically triggers GitHub Actions:
- C++ build + test on Linux (GCC + Clang), macOS, and Windows (MSVC)
- Python build + test on Linux, macOS, Windows (Python 3.10 + 3.12)
- ASan + UBSan sanitizer checks on Linux
- Documentation build (Doxygen + Sphinx)
Valgrind Memcheck + Helgrind can be run locally via bash scripts/valgrind_all.sh (excluded from CI due to runtime overhead on large graph tests).
Datasets
TinyGNN tests against two real-world graph datasets. Download them with the included Python script:
# Cora only (no extra Python deps, ~170 KB download)
python3 scripts/fetch_datasets.py --cora-only
# Both Cora and Reddit (needs numpy + scipy, ~200 MB download)
pip install numpy scipy
python3 scripts/fetch_datasets.py
| Dataset | Source | Nodes | Edges | Features | CSV size |
|---|---|---|---|---|---|
| Cora | LINQS (McCallum et al. 2000) | 2,708 | 5,429 | 1,433 (binary bag-of-words) | 7.5 MB |
| DGL / GraphSAGE (Hamilton et al. 2017) | 232,965 | 114,615,892 | 602 (GloVe embeddings) | 2.7 GB |
Files are placed in datasets/cora/ and datasets/reddit/ (gitignored).
Tests that require actual datasets skip gracefully when the files are not present.
Phase 1 -- Hybrid Tensor Core
Goal
Establish the fundamental data structure by expanding a contiguous memory design to support both dense and sparse graph layouts, without any third-party libraries.
Design
StorageFormat enum
enum class StorageFormat : uint8_t {
Dense = 0, // Row-major contiguous storage
SparseCSR = 1 // Compressed Sparse Row
};
Tensor struct layout
Dense:
data_ -> [v00, v01, ..., v(R-1)(C-1)] rows*cols floats, row-major
strides_ -> {cols, 1}
SparseCSR:
row_ptr_ -> size (rows+1) row_ptr[i]..row_ptr[i+1] = col indices of row i
col_ind_ -> size nnz column index of each non-zero
data_ -> size nnz value of each non-zero
Memory footprint
| Format | Formula | 1000x1000 example |
|---|---|---|
| Dense | rows * cols * 4 bytes | 4,000,000 bytes (3.8 MB) |
| SparseCSR | nnz4 + nnz4 + (rows+1)*4 bytes | 44,004 bytes with 5000 edges (43 KB) |
Sparse uses ~1.1% of the memory of the equivalent dense matrix in this benchmark.
Results
Dense 1000x1000 memory = 4,000,000 bytes
Sparse 1000x1000 (5000 nnz) = 44,004 bytes
Memory ratio (sparse / dense) = 1.10%
Very sparse (10 nnz in 10k x10k) = 0.01%
Test suite -- 104 assertions, 18 test functions, 0 failures
| Category | Functions |
|---|---|
| Dense construction and mutation | 3 |
| Memory footprint validation | 2 |
| Memory comparison / reduction ratios | 2 |
| Edge cases (empty, 1x1, 0-nnz, degenerate shapes, CSR data integrity) | 6 |
| Error handling (invalid args, out-of-bounds, wrong format access) | 4 |
| Utility (repr, enum values) | 1 |
Total : 104
Passed: 104
Failed: 0
Phase 2 -- Graph Data Loader
Goal
Load irregular graph structures from CSV files and convert them to hardware-friendly sorted CSR format, ready for GNN inference.
Design
GraphLoader class
class GraphLoader {
public:
// Edge-list CSV -> vector of (src, dst) pairs
static std::vector<std::pair<int32_t, int32_t>>
parse_edges(const std::string& path);
// Feature CSV -> Dense tensor (num_nodes x num_features)
static Tensor parse_features(const std::string& path);
// Raw edge list -> sorted SparseCSR adjacency tensor
static Tensor edge_list_to_csr(
const std::vector<std::pair<int32_t, int32_t>>& edges,
std::size_t num_nodes);
// Full pipeline: CSV files -> GraphData (adjacency + features)
static GraphData load(const std::string& edges_path,
const std::string& features_path);
};
CSR conversion algorithm -- O(E + V + E*log(E/V))
Step 1: Count out-degree per node O(E)
Step 2: Build row_ptr via prefix sum O(V)
Step 3: Fill col_ind via offset cursors O(E)
Step 4: Sort col_ind within each row O(E*log(E/V)) amortised
CSV format support
- Automatic header row detection (skips non-numeric first lines)
- Both LF and CRLF line endings
- Whitespace-tolerant parsing
- Out-of-order node IDs with gap zero-filling in feature tensor
Results (Cora-scale synthetic benchmark)
Nodes: 2,708
Edges: 10,556
Features: 1,433 per node
Adjacency: Tensor(2708x2708, SparseCSR, 95,284 bytes)
Features: Tensor(2708x1433, Dense, 15,522,256 bytes)
Results (Reddit-scale synthetic benchmark)
Reddit (Hamilton et al. 2017, GraphSAGE paper): 232,965 nodes, 114,615,892 directed edges, 602 features. Generated synthetically with a fixed seed; full file pipeline writes ~1.4 GB edge CSV + ~282 MB feature CSV.
Nodes: 232,965
Edges: 114,615,892
Features: 602 per node
Adjacency: Tensor(232965x232965, SparseCSR, 917,859,000 bytes) (~875 MB)
Features: Tensor(232965x602, Dense, 560,979,720 bytes) (~535 MB)
In-memory CSR algorithm test:
Edge generation (114M pairs): 1,350 ms
CSR construction + sort: 5,591 ms
Total: 6,979 ms
Full file-pipeline test:
Edge file write (1.4 GB): 25,491 ms
Feature file write (282 MB): 3,748 ms
GraphLoader::load(): 119,047 ms
Total: 148,286 ms
Node 0 neighbors: 455 (CSR traversal matches raw CSV exactly)
Results (actual Cora dataset)
The real Cora citation network (McCallum et al. 2000): 2,708 papers linked by 5,429 directed citations, with 1,433 binary bag-of-words features per paper.
Nodes: 2,708
Edges: 5,429
Features: 1,433 per node (binary)
Adjacency: Tensor(2708x2708, SparseCSR, 54,268 bytes) (~53 KB)
Features: Tensor(2708x1433, Dense, 15,522,256 bytes) (~14.8 MB)
Feature density: 49,216 / 3,880,564 entries are 1 (1.27%)
Node 0 neighbors: 3 (CSR traversal matches raw CSV exactly)
Results (actual Reddit dataset)
The real Reddit post network (Hamilton et al. 2017, GraphSAGE): 232,965 posts linked by 114,615,892 directed edges, with 602-dimensional GloVe word embeddings per post.
Nodes: 232,965
Edges: 114,615,892
Features: 602 per node (GloVe float)
Adjacency: Tensor(232965x232965, SparseCSR, 917,859,000 bytes) (~875 MB)
Features: Tensor(232965x602, Dense, 560,979,720 bytes) (~535 MB)
GraphLoader::load() time: 157 s (WSL, Ubuntu 24.04, GCC 13.3)
Node 0 neighbors: 2,204
Test suite -- 269 assertions, 49 test functions, 0 failures
| Category | Functions | Covers |
|---|---|---|
| Edge CSV parsing | 6 | header, no-header, CRLF, self-loops, trailing blanks |
| Feature CSV parsing | 6 | ordered, unordered, sparse IDs, negative values |
| CSR conversion | 8 | exact arrays, sort invariant, empty rows, memory footprint |
| Full load pipeline | 4 | node-0 neighbors, all-nodes verify, feature zero-expansion |
| Cora-scale validation | 3 | 2708 nodes / 10556 edges / 1433 features / row_ptr invariants |
| Reddit-scale validation | 4 | 232,965 nodes / 114,615,892 edges / in-memory CSR + full pipeline + node-0 match |
| Actual Cora dataset | 4 | 2,708 nodes / 5,429 real citation edges / binary features / node-0 match |
| Actual Reddit dataset | 4 | 232,965 nodes / 114,615,892 real edges / GloVe features / CSR invariants |
| Error handling | 10 | file-not-found, malformed lines, negative IDs, out-of-range |
Total : 269
Passed: 269
Failed: 0
Phase 3 -- Dense Compute Kernels (GEMM)
Goal
Implement the baseline dense matrix multiplication needed for the GNN feature transformation H' = H x W, where H is the node feature matrix and W is the learned weight matrix.
Design
matmul function
// C = A x B
// A: Dense (M x K) B: Dense (K x N) C: Dense (M x N), newly allocated
Tensor matmul(const Tensor& A, const Tensor& B);
Loop order -- (i, k, j)
The implementation uses the (i, k, j) traversal order rather than the naive (i, j, k):
for i in [0, M): // row of A and C
for k in [0, K): // shared dimension
a_ik = A[i][k] // hoisted into scalar register
for j in [0, N): // row of B, column of C
C[i][j] += a_ik * B[k][j]
Why (i, k, j): A[i][k] is loop-invariant across the inner j-loop and lives in a register. B[k][j] is accessed row-sequentially (cache-friendly). The compiler can auto-vectorise the inner j-loop because the scalar a_ik eliminates a load dependency. __restrict__ pointers are used to communicate the no-aliasing guarantee to the compiler.
Complexity
| Metric | Value |
|---|---|
| Time complexity | O(M * K * N) |
| FLOPs | 2 * M * K * N (1 multiply + 1 add per A element) |
| Output allocation | M * N * 4 bytes |
Preconditions enforced
- Both operands must be StorageFormat::Dense (sparse-dense SpMM is Phase 4)
- A.cols() must equal B.rows() -- throws std::invalid_argument with both dimensions in the message
- Zero-dimension tensors produce a valid zero-filled output
Results
matmul(128x128, 128x128) = 1 ms
matmul(1024x32, 32x256) = 4 ms (output = 1024 KB)
matmul(256x256, I_256) spot checks: 8/8 pass
matmul(5x8, 8x6) footprint = 120 bytes (5*6*4)
GNN feature transform (H x W, one-hot nodes selecting rows of W):
H (3x4) x W (4x2) -> H' (3x2)
Node 0: [1, 2] Node 1: [3, 4] Node 2: [5, 6]
4x4 hardcoded result (spec required):
A = reshape([1..16], 4, 4) B = reshape([17..32], 4, 4)
C = A x B:
[ 250, 260, 270, 280 ]
[ 618, 644, 670, 696 ]
[ 986, 1028, 1070, 1112 ]
[ 1354, 1412, 1470, 1528 ]
Test suite -- 268 assertions, 30 test functions, 0 failures
| Category | Functions | Covers |
|---|---|---|
| 4x4 hardcoded result (spec) | 3 | full matrix, element spot-checks, 2x2 sub-block cross-check |
| Non-square shapes | 3 | (2x3)x(3x4), matrix x column-vector, row-vector x matrix |
| Identity and zero properties | 4 | AI=A, IA=A, non-square identity, zero matrix output |
| GNN feature transform | 1 | H x W with one-hot nodes verifying row selection from W |
| Algebraic properties | 3 | non-commutativity, associativity, scalar distributivity |
| Output tensor properties | 3 | format=Dense, shape MxN, footprint = MN4 bytes |
| Stress / scale | 3 | 128x128 all-ones, 1024x32x256 rectangle, 256x256 identity |
| Dimension mismatch errors | 5 | basic throw, message content, 4 bad-shape combos, degenerate 0-dim |
| Sparse input errors | 2 | sparse A throws, sparse B throws (SparseCSR in message) |
| Edge / degenerate cases | 3 | 1x1 scalar, all-zeros output, GNN layer chain (AB)C = A(BC) |
Total : 268
Passed: 268
Failed: 0
Phase 4 -- Sparse Compute Kernels (SpMM)
Goal
Implement the sparse-dense matrix multiplication kernel that is the heart of GNN message-passing: H_agg = Adj × H, aggregating each node's neighbor features via the sparse adjacency matrix.
Design
spmm function
// C = A × B
// A: SparseCSR (M × K) B: Dense (K × N) C: Dense (M × N), newly allocated
Tensor spmm(const Tensor& A, const Tensor& B);
Algorithm -- CSR-SpMM
for i in [0, M): // row of A (node)
for nz in [row_ptr[i], row_ptr[i+1]): // non-zeros in row i (neighbors)
k = col_ind[nz] // column (neighbor node ID)
a_val = data[nz] // edge weight (1.0 for unweighted)
for j in [0, N): // feature dimension
C[i][j] += a_val * B[k][j] // accumulate neighbor's features
Why CSR-SpMM: The outer loop walks rows (nodes). The middle loop walks non-zero columns (graph neighbors) — this is the message-passing. The inner loop accumulates dense features — cache-friendly because B[k][j] is contiguous in memory and the compiler can auto-vectorise with the scalar a_val hoisted to a register.
Complexity
| Metric | Value |
|---|---|
| Time complexity | O(nnz × N) — proportional to edges × features |
| FLOPs | 2 × nnz × N (1 multiply + 1 add per edge per feature) |
| Output allocation | M × N × 4 bytes |
Preconditions enforced
- A must be StorageFormat::SparseCSR (Dense A → "use matmul() instead")
- B must be StorageFormat::Dense (Sparse B → "Sparse × Sparse not supported")
- A.cols() must equal B.rows() — throws std::invalid_argument with dimensions in message
- Early exit for degenerate cases (M=0, N=0, or nnz=0) producing valid zero-filled output
Results
3×3 hand-calculated SpMM (spec required):
A (SparseCSR): B (Dense):
[1 1 0] row_ptr=[0,2,3,5] [1 2]
[0 1 0] col_ind=[0,1,1,0,2] [3 4]
[1 0 1] values =[1,1,1,1,1] [5 6]
C = spmm(A, B):
[4, 6] ← 1·[1,2] + 1·[3,4]
[3, 4] ← 1·[3,4]
[6, 8] ← 1·[1,2] + 1·[5,6]
4×4 weighted hand-calculated:
A (SparseCSR): B (Dense):
[2 0 0 0] values=[2,3, [1 0 2]
[0 3 1 0] 1,1,2,1,1] [0 1 0]
[1 0 0 2] [3 0 1]
[0 1 1 0] [0 2 0]
C = spmm(A, B):
[2, 0, 4] [3, 3, 1] [1, 4, 2] [3, 1, 1]
GNN message-passing (star graph, 5 nodes):
H_agg = spmm(Adj_star, H)
Hub (node 0): sum of all features = [10, 40]
Leaf i: H[0] + H[i] (receives from hub + self)
Equivalence with dense matmul verified for:
- 32×32 sparse (~9% density) × 32×16 dense
- Cora-scale: 2,708×2,708 sparse (3 nnz/row) × 2,708×32 dense
Test suite -- 306 assertions, 35 test functions, 0 failures
| Category | Functions | Covers |
|---|---|---|
| 3×3 hand-calculated (spec) | 3 | full matrix, element checks, column vector |
| 4×4 weighted CSR | 2 | full result, element derivations |
| Non-square shapes | 3 | (5×3)×(3×4), (2×5)×(5×1), (1×4)×(4×3) |
| Identity & zero properties | 4 | sparse I × B = B, 64×64, zero-nnz, all-ones |
| GNN message-passing | 3 | triangle (fully connected), star graph, two-hop path chain |
| Dense matmul equivalence | 2 | small 3×3, medium 32×32 deterministic sparse |
| Output tensor properties | 3 | format=Dense, shape M×N, memory footprint |
| Stress / scale | 3 | 512×512 identity, 2708-node Cora-scale, 1024×128 |
| Wrong format errors | 4 | dense A throws, message check, sparse B throws, message check |
| Dimension mismatch | 3 | basic throw, message "dimension mismatch", multiple combos |
| Edge / degenerate cases | 5 | single nnz, empty rows, diagonal/self-loops, negative weights, 1×1 |
Total : 306
Passed: 306
Failed: 0
Phase 5 -- Activations & Utilities
Goal
Implement the element-wise activation functions and utilities needed between GNN layers: H' = Activation(Adj × H × W + b). These operate in-place on Dense tensors for zero-copy efficiency.
Design
Activation functions
| Function | Formula | GNN Use Case |
|---|---|---|
relu_inplace(X) |
max(0, x) | GCN, GraphSAGE, GIN — standard hidden layer activation |
leaky_relu_inplace(X, α) |
x if x≥0, else αx (default α=0.01) | GAT attention coefficients (α=0.2) |
elu_inplace(X, α) |
x if x≥0, else α(eˣ−1) (default α=1.0) | EGNN, PNA — smooth negative region |
sigmoid_inplace(X) |
1/(1+e⁻ˣ) | GGNN gating, link prediction, binary classification |
tanh_inplace(X) |
tanh(x) | GRU/LSTM graph networks (GGNN, MPNN) |
gelu_inplace(X) |
x·Φ(x) ≈ 0.5x(1+tanh(√(2/π)(x+0.044715x³))) | GPS, Graphormer, TokenGT — transformer GNNs |
softmax_inplace(X) |
row-wise softmax with max-subtraction stability | GAT attention normalization, classification output |
log_softmax_inplace(X) |
row-wise log-softmax (numerically stable) | NLL loss for node classification |
Utility functions
| Function | Operation | GNN Use Case |
|---|---|---|
add_bias(X, bias) |
X[i][j] += bias[0][j] — broadcast (1×N) across M rows | Linear layer bias: H' = AHW + b |
Common properties
- All activations modify the tensor in-place (O(1) extra memory)
- All validate Dense format and throw
std::invalid_argumenton SparseCSR - Softmax / log-softmax use the max-subtraction trick for numerical stability
- Sigmoid uses a two-branch implementation (x≥0 vs x<0) to prevent overflow
- GELU uses the tanh approximation (same as PyTorch
F.gelu(approximate='tanh'))
Complexity
| Function | Time | Extra Memory |
|---|---|---|
| relu, leaky_relu, sigmoid, tanh | O(M × N) | O(1) |
| elu, gelu | O(M × N) with exp/tanh on subset | O(1) |
| softmax, log_softmax | O(M × N) — 3 passes per row | O(1) |
| add_bias | O(M × N) | O(1) |
Results
GCN layer pipeline (identity Adj, 3 nodes, 2 features):
H' = ReLU(Adj × H × W + b)
Node 0: [1.5, 0.0] (relu clamps negative bias-shifted value)
Node 1: [2.5, 0.0]
Node 2: [3.5, 0.0]
GAT-style attention (LeakyReLU + Softmax):
Scores: [0.5, -0.3, 1.2]
LeakyReLU(α=0.2): [0.5, -0.06, 1.2]
Softmax: attention weights sum to 1.0, largest score gets most weight
Node classification (Log-Softmax for NLL loss):
3 nodes × 4 classes
All log-probabilities ≤ 0, exp(row) sums to 1.0
Argmax preserved through log-softmax
Numerical stability verified:
softmax([1000, 1001]) — no NaN/Inf, correct probabilities
log_softmax([1000, 1001, 999]) — no NaN/Inf, exp sums to 1.0
Test suite -- 266 assertions, 70 test functions, 0 failures
| Category | Functions | Covers |
|---|---|---|
| ReLU | 5 | basic, 2D matrix, all-positive, all-negative, zero preservation |
| Leaky ReLU | 5 | default α=0.01, GAT α=0.2, 2D, α=0→ReLU, α=1→identity |
| ELU | 4 | basic, custom α=2.0, saturation at −α, 2D mixed signs |
| Sigmoid | 5 | basic values, σ(x)+σ(−x)=1 symmetry, bounds (0,1), hand-calculated, 2D |
| Tanh | 5 | basic, odd symmetry, bounds (−1,1), hand-calculated, 2D |
| GELU | 4 | basic (0→0, large→x, neg→0), hand-calculated vs reference, zero symmetry, 2D |
| Softmax | 7 | row sums=1, bounds [0,1], uniform dist, hand-calculated, multi-row, numerical stability, argmax preserved |
| Log-Softmax | 6 | exp sums=1, all ≤ 0, consistent with softmax, uniform, numerical stability, multi-row |
| Add bias | 6 | basic broadcasting, zero bias, negative bias, single-row, wrong-rows throw, wrong-cols throw |
| GNN pipeline integration | 3 | full GCN layer, GAT attention, node classification |
| Error handling (SparseCSR) | 10 | all 9 activations + add_bias reject sparse input |
| Stress / scale | 5 | relu 1024×256, softmax 1024×128, sigmoid 512×512, bias 2708×32, gelu 256×256 |
| Edge / degenerate | 5 | 1×1 through all activations, zero-cols rejected, zero-rows no-op, double-apply idempotency |
Total : 266
Passed: 266
Failed: 0
Phase 6 -- Multi-Architecture Layer Assembly
Goal
Assemble complete GNN layer architectures from the tensor primitives built in Phases 1–5, implementing the three canonical propagation strategies: GCN (spectral normalization), GraphSAGE (inductive mean/max aggregation), and GAT (attention-based message passing).
Design
Helper functions
// Â = A + I — insert identity self-loops into CSR (two-pass: detect diag, rebuild)
Tensor add_self_loops(const Tensor& A);
// D̃^(-1/2) Â D̃^(-1/2) — symmetric GCN normalization
Tensor gcn_norm(const Tensor& A);
// Row-wise sparse softmax over CSR non-zeros only (max-subtract, two-pass)
Tensor edge_softmax(const Tensor& A);
// Element-wise max pooling over CSR neighborhoods
Tensor sage_max_aggregate(const Tensor& A, const Tensor& H);
GCN Layer (GCNLayer)
Kipf & Welling, ICLR 2017
H' = σ( Â_norm · (H · W) + b )
̂ precomputed by gcn_norm(): D̃^(-1/2) (A+I) D̃^(-1/2)
A = precomputed
Forward:
Step 1: HW = matmul(H, W) [N × F_in] × [F_in × F_out] → [N × F_out]
Step 2: out = spmm(Â_norm, HW) sparse message-passing
Step 3: add_bias(out, b) optional
Step 4: σ(out) ReLU or None
GraphSAGE Layer (SAGELayer)
Hamilton, Ying & Leskovec, NeurIPS 2017
h_v' = σ( W_neigh · AGG({h_u : u ∈ N(v)}) + W_self · h_v + b )
AGG variants:
Mean: spmm(A, H) / degree[v] (normalized neighbor sum)
Max: element-wise max over CSR row (sage_max_aggregate)
Forward:
Step 1: agg = Mean-spmm or sage_max_aggregate
Step 2: h_neigh = matmul(agg, W_neigh)
Step 3: h_self = matmul(H, W_self)
Step 4: out = h_neigh + h_self (element-wise add)
Step 5: add_bias(out, b) optional
Step 6: σ(out) ReLU or None
GAT Layer (GATLayer)
Veličković et al., ICLR 2018
Forward (single attention head):
Step 1: Wh = matmul(H, W) [N × F_out]
Step 2: src[i] = a_l · Wh[i], dst[j] = a_r · Wh[j] per node
Step 3: e_ij = LeakyReLU(src[i] + dst[j]) SpSDDMM — one value per CSR edge
Step 4: α = edge_softmax(e_csr) sparse softmax per row
Step 5: out = spmm(α, Wh) + b attention-weighted aggregation
Step 6: σ(out)
Weights: W (F_in×F_out), a_l (1×F_out), a_r (1×F_out), b (1×F_out)
A must include self-loops (use add_self_loops before forward).
Phase 6a: GCN Layer
Test suite -- 764 assertions, 68 test functions, 0 failures
| Category | Tests | Covers |
|---|---|---|
| add_self_loops | 7 | basic, already-has-diagonal, empty |
| gcn_norm | 7 | hand-verified D̃^(-1/2) Â D̃^(-1/2) values |
| GCNLayer construction + weight management | 8 | shape checks, wrong-format, wrong-shape throws |
| GCNLayer forward — 3-node hand-computed | 6 | row-by-row numerical verification |
| GCNLayer forward — no bias, no activation | 5 | identity propagation |
| Two-layer GCN stacking | 4 | output shape and feature composition |
| 10-node ring graph — full pipeline | 11 | symmetry, degree normalization, activation |
| Error handling | 12 | wrong formats, dimension mismatches, error messages |
| Edge / degenerate cases | 8 | single node, isolated nodes, identity adjacency |
Total : 764
Passed: 764
Failed: 0
Phase 6b: GraphSAGE Layer
Test suite -- 1,011 assertions, 60 test functions, 0 failures
| Category | Tests | Covers |
|---|---|---|
| sage_max_aggregate | 8 | basic, empty graph, single node |
| SAGELayer construction + weight management | 9 | shape checks, aggregator type, wrong-format throws |
| SAGELayer Mean forward — 3-node hand-computed | 7 | degree normalization, W_neigh/W_self combination |
| SAGELayer Max forward — 3-node hand-computed | 7 | max pooling, W_neigh/W_self combination |
| Multi-layer GraphSAGE | 4 | two-layer composition, feature dimension flow |
| 10-node graph tests | 7 | mean vs max comparison, scale, bias/activation |
| Error handling | 10 | dense A, sparse H, dimension mismatches, messages |
| Edge / degenerate cases | 8 | isolated nodes get zero aggregation |
Total : 1011
Passed: 1011
Failed: 0
Phase 6c: GAT Layer
Test suite -- 1,284 assertions, 60 test functions, 0 failures
| Category | Tests | Covers |
|---|---|---|
| edge_softmax | 10 | row sums = 1, uniform input, single-entry rows, numerical stability |
| GATLayer construction + weight management | 12 | shape checks, wrong-format, wrong-shape throws |
| GATLayer forward — 3-node hand-computed | 10 | attention score computation, softmax, weighted SpMM |
| GATLayer forward — bias and activation | 8 | bias add, ReLU gate |
| Multi-layer GAT + mixed layer stacking | 6 | GAT→GAT, GCN→GAT, SAGEMax→GAT compositions |
| Larger graph tests (10-node, 100-node) | 8 | scale, output shape, attention normalization |
| Error handling | 10 | wrong formats, dimension mismatches, negative slope validation |
| Edge / degenerate cases | 6 | self-loop-only, single node, all-zero attention |
Total : 1284
Passed: 1284
Failed: 0
Phase 7 -- Multi-Model End-to-End Pipeline & Reference Matching
Goal
Close the loop from training to inference: train three GNN architectures (GCN, GraphSAGE, GAT) on the standard Cora citation dataset using PyTorch Geometric, export their FP32 weights to a custom binary format, and validate that TinyGNN's C++ inference engine reproduces the same classification accuracy.
Architecture
Python Training Pipeline (scripts/train_cora.py)
- Loads Cora from PyG's
Planetoiddataset (2,708 nodes, 10,556 directed edges, 1,433 features, 7 classes) - Trains three models with early stopping on validation accuracy:
- GCN: 2-layer, 1433 → 64 → 7, ReLU, dropout 0.5
- GraphSAGE: 2-layer Mean aggregation, 1433 → 64 → 7, ReLU, dropout 0.5
- GAT: 2-layer, 8 heads × 8 features (concat) → 7 (single head), ELU, dropout 0.6
- Exports binary graph data and per-model weight files
Binary Format
Graph data (cora_graph.bin):
num_nodes, num_features, num_classes, num_edges (4 × uint32)
features (N × F × float32, row-major)
labels (N × int32)
train_mask, val_mask, test_mask (N × uint8 each)
edge_src, edge_dst (E × int32 each)
Weight files ({gcn,sage,gat}_cora.bin):
Magic "TGNN" (4 bytes)
Version (uint32 = 1)
TestAccuracy (float32)
NumTensors (uint32)
─ per tensor ─
NameLen (uint32) | Name (bytes) | Rows (uint32) | Cols (uint32) | Data (float32[])
C++ Model Class (model.hpp / model.cpp)
The Model class implements a dynamic execution graph that chains heterogeneous GNN layers:
Model model;
model.add_gcn_layer(1433, 64, true, Activation::ReLU);
model.add_gcn_layer(64, 7, true, Activation::None);
model.load_weights("weights/gcn_cora.bin");
Tensor logits = model.forward(adjacency, features);
Key features:
- Heterogeneous layer support: GCN, SAGE, and GAT layers can be freely mixed in a single execution graph
- Multi-head GAT: multiple independent
GATLayerinstances per logical layer, with concat or average aggregation - Automatic adjacency preprocessing:
gcn_norm()for GCN layers, raw adjacency for SAGE,add_self_loops()for GAT — computed once and cached - Inter-layer activations: ELU, ReLU, or None applied between layers (separate from each layer's internal activation)
Weight Mapping (PyG → TinyGNN)
| PyG Parameter | TinyGNN Parameter | Transform |
|---|---|---|
GCNConv.lin.weight (out×in) |
GCNLayer.weight (in×out) |
Transpose |
GCNConv.bias (out) |
GCNLayer.bias (1×out) |
Reshape |
SAGEConv.lin_l.weight (out×in) |
SAGELayer.weight_neigh (in×out) |
Transpose |
SAGEConv.lin_r.weight (out×in) |
SAGELayer.weight_self (in×out) |
Transpose |
GATConv.lin.weight[h*F:(h+1)*F] |
GATLayer.weight per head (in×F) |
Slice + Transpose |
GATConv.att_src[0,h,:] |
GATLayer.attn_left (1×F) |
Extract head |
GATConv.att_dst[0,h,:] |
GATLayer.attn_right (1×F) |
Extract head |
Accuracy Results
| Model | PyG Accuracy | C++ Accuracy | Match |
|---|---|---|---|
| GCN (2-layer, 64 hidden) | 78.3% | 78.3% | Exact |
| GraphSAGE (2-layer, Mean, 64 hidden) | 80.6% | 80.6% | Exact |
| GAT (8-head + 1-head, ELU) | 80.3% | 79.3% | ±1% (FP rounding) |
Test Suite -- 13,862 assertions, 26 test functions, 0 failures
| Category | Tests | Covers |
|---|---|---|
| Data loading | 3 | Binary graph loader, weight file loader, data integrity |
| Weight shape validation | 3 | GCN/SAGE/GAT tensor shapes match architecture |
| Model API | 5 | Layer addition, empty model throws, missing weight throws |
| Small graph forward pass | 4 | GCN/SAGE/GAT on 3-node graph, multi-head concat |
| Logit sanity | 4 | All outputs finite, log-softmax sums to 1 |
| Prediction distribution | 3 | All 7 classes present in predictions |
| End-to-end accuracy | 4 | GCN/SAGE/GAT match PyG ±5%, train accuracy ≥ 85% |
Tests run: 26
Assertions: 13862
Passed: 13862
Failed: 0
Phase 8 -- Python Bridge (pybind11)
Goal
Expose the entire TinyGNN C++ inference engine to Python via pybind11, enabling researchers to import tinygnn and run GNN inference from Python with NumPy/SciPy/PyTorch interoperability.
Architecture
pybind11 Bindings (python/tinygnn_ext.cpp)
A single C++ source wrapping the full TinyGNN API:
- Tensor — zero-copy NumPy interop via
from_numpy(),to_numpy(), SciPy CSR viafrom_scipy_csr(), PyG edge_index viafrom_edge_index() - Ops — all 8 activations, matmul, spmm, add_bias (in-place semantics preserved)
- Graph utilities — add_self_loops, gcn_norm, edge_softmax, sage_max_aggregate
- Layers — GCNLayer, SAGELayer (Mean/Max), GATLayer with full weight management
- Model — dynamic execution graph with weight loading from TGNN binary files
- I/O — load_cora_binary, load_weight_file, GraphLoader
Build System (setup.py)
- Uses setuptools + pybind11 C++ extension
- MinGW cross-compilation with static GCC runtime linking (no DLL dependencies)
- Produces
_tinygnn_core.cp312-win_amd64.pyd(~558 KB) - Clean
pip install -e .orpython setup.py build_ext --inplace
Python Package (python/tinygnn/)
import tinygnn
import numpy as np
# Create tensor from NumPy
X = tinygnn.Tensor.from_numpy(np.random.randn(10, 16).astype(np.float32))
# Build and run GCN model
model = tinygnn.Model()
model.add_gcn_layer(1433, 64, activation=tinygnn.Activation.RELU)
model.add_gcn_layer(64, 7, activation=tinygnn.Activation.NONE)
model.load_weights("weights/gcn_cora.bin")
logits = model.forward(adjacency, features)
result = logits.to_numpy() # → NumPy array
Zero-Copy Interop
| Method | Direction | Data Type |
|---|---|---|
Tensor.from_numpy(arr) |
NumPy → TinyGNN | 2D float32/float64 array → Dense Tensor |
Tensor.from_numpy_1d(arr) |
NumPy → TinyGNN | 1D array → (1×N) Dense Tensor (bias) |
Tensor.to_numpy() |
TinyGNN → NumPy | Dense Tensor → 2D float32 array |
Tensor.from_scipy_csr(mat) |
SciPy → TinyGNN | csr_matrix → SparseCSR Tensor |
Tensor.from_edge_index(ei, N) |
PyG → TinyGNN | (2×E) int32 COO → SparseCSR Tensor |
Tensor.row_ptr_numpy() |
TinyGNN → NumPy | CSR row_ptr as int32 array |
Tensor.col_ind_numpy() |
TinyGNN → NumPy | CSR col_ind as int32 array |
Cora Validation (scripts/validate_cora.py)
Two validation modes:
Mode 1 — Weight File Validation: Loads pre-trained weights, runs TinyGNN inference, verifies test accuracy within 1% of expected.
Mode 2 — Logit-Level Comparison: Trains a fresh GCN in PyG, exports weights, runs both PyG and TinyGNN on identical inputs, asserts logits match:
| Metric | Result |
|---|---|
| Max absolute logit diff | 0.000002 |
| Mean absolute logit diff | 0.000000 |
| Mean relative diff | 0.000001 |
| Prediction agreement | 100.0% |
Test Suite -- 49 pytest tests, 0 failures
| Category | Tests | Covers |
|---|---|---|
| Dense Tensor | 9 | creation, from_numpy, to_numpy, roundtrip, repr, memory |
| Sparse CSR Tensor | 5 | creation, CSR accessors, from_scipy, from_edge_index, error |
| Ops (matmul/spmm) | 4 | identity, values, SpMM identity, SpMM aggregation |
| Activations | 9 | relu, leaky_relu, elu, sigmoid, tanh, gelu, softmax, log_softmax, add_bias |
| Graph utilities | 4 | add_self_loops, gcn_norm, edge_softmax, sage_max_aggregate |
| GCN Layer | 3 | construction, identity weight forward, ReLU forward |
| SAGE Layer | 4 | Mean/Max construction, Mean forward, Max forward |
| GAT Layer | 2 | construction, forward pass |
| Model | 3 | GCN model build+forward, SAGE model build+forward, weight file loading |
| Enums | 3 | StorageFormat, Activation, Aggregator |
| Error handling | 3 | wrong ndim, sparse-to-numpy, bad edge_index shape |
Tests run: 49
Passed: 49
Failed: 0
Phase 9 -- Software-Level Parallelism (OpenMP & AVX2)
Goal
Maximize CPU throughput by adding thread-level parallelism (OpenMP) and data-level parallelism (AVX2 SIMD intrinsics) to the compute kernels, achieving near-linear speedup on multi-core CPUs.
Design
OpenMP Parallelism
All outer (row/node) loops are parallelized with OpenMP pragmas:
| Kernel | Pragma | Schedule | Rationale |
|---|---|---|---|
matmul |
#pragma omp parallel for |
dynamic |
Rows are uniform, but dynamic handles cache-sharing effects |
spmm |
#pragma omp parallel for |
dynamic |
Power-law graphs have highly variable row lengths |
softmax_inplace |
#pragma omp parallel for |
dynamic |
Each row is independent (3 passes: max, exp+sum, normalize) |
log_softmax_inplace |
#pragma omp parallel for |
dynamic |
Same row-independent structure as softmax |
relu/elu/sigmoid/tanh/gelu |
#pragma omp parallel for |
static |
Element-wise ops with uniform work per iteration |
add_bias |
#pragma omp parallel for |
static |
Uniform row-based broadcasting |
AVX2 + FMA Intrinsics
The inner feature-dimension loops of matmul and spmm use 256-bit SIMD (8 floats per cycle):
// Inner loop of spmm (8-wide FMA):
const __m256 va = _mm256_set1_ps(a_val); // broadcast edge weight
__m256 vc = _mm256_loadu_ps(crow + j); // load 8 output features
__m256 vb = _mm256_loadu_ps(brow + j); // load 8 neighbor features
vc = _mm256_fmadd_ps(va, vb, vc); // fused multiply-add
_mm256_storeu_ps(crow + j, vc); // store 8 results
Scalar tail handles feature dimensions not divisible by 8. Code compiles cleanly without AVX2 via #ifdef __AVX2__ guards.
Build Flags
g++ -std=c++17 -O2 -fopenmp -mavx2 -mfma ...
CMake automatically detects OpenMP and AVX2/FMA support via find_package(OpenMP) and check_cxx_compiler_flag.
Benchmark Results
Machine-dependent results — numbers below measured on Intel Core 9 270H (20 logical threads), GCC 13.3, Ubuntu 24.04 (WSL2). Thread counts, memory bandwidth, and topology will produce different speedup figures on other hardware.
Tested on Intel Core 9 270H (20 threads), GCC 13.3, Ubuntu 24.04 (WSL2):
Cora-scale (2,708 nodes x 1,433 features, avg degree 3.9)
| Threads | SpMM | Speedup | GEMM | Speedup | Softmax | Speedup |
|---|---|---|---|---|---|---|
| 1 | 6.4 ms | 1.00x | 500 ms | 1.00x | 15.4 ms | 1.00x |
| 2 | 4.0 ms | 1.58x | 250 ms | 2.00x | 10.4 ms | 1.48x |
| 4 | 3.4 ms | 1.85x | 167 ms | 2.99x | 6.6 ms | 2.35x |
| 8 | 2.6 ms | 2.49x | 114 ms | 4.38x | 4.6 ms | 3.35x |
Large-scale (50,000 nodes x 256 features, avg degree 15)
| Threads | SpMM | Speedup | Softmax | Speedup |
|---|---|---|---|---|
| 1 | 152 ms | 1.00x | 69.0 ms | 1.00x |
| 2 | 81 ms | 1.87x | 65.0 ms | 1.06x |
| 4 | 57 ms | 2.68x | 56.7 ms | 1.22x |
| 8 | 68 ms | 2.24x | 48.0 ms | 1.44x |
Key observations:
- GEMM achieves super-linear scaling (4.38x on 8 threads) due to improved cache utilization when threads share L2/L3
- SpMM scales well (2.49x on 8 threads for Cora) despite irregular memory access patterns
- Softmax scales efficiently (3.35x on 8 threads) because rows are fully independent
- Large-scale SpMM saturates at 4 threads — memory bandwidth becomes the bottleneck
Thread-Scaling Chart
Phase 10 -- Operator Fusion (GAT & GraphSAGE)
Goal
Fuse multi-step GNN layer operations into single row-wise loops, eliminating intermediate tensor allocations and reducing peak memory consumption while improving runtime through better data locality.
Design
GAT Fusion: SpSDDMM + edge_softmax + SpMM
The unfused GAT forward pass allocates three nnz-sized intermediate tensors:
| Step | Operation | Intermediate | Size |
|---|---|---|---|
| 3 | SpSDDMM | edge_logits vector |
nnz × 4 bytes |
| 3 | Build CSR | attn_csr (rp + ci + vals copies) |
(N+1)×4 + 2×nnz×4 bytes |
| 4 | edge_softmax | alpha CSR (another full copy) |
(N+1)×4 + 2×nnz×4 bytes |
The fused kernel replaces Steps 3–5 with a single row-wise loop:
for each node i: // parallelized with OpenMP
for j ∈ N(i):
e_ij = LeakyReLU(src_score[i] + dst_score[j]) // SpSDDMM
α = softmax(e_row) // edge_softmax
out[i] = Σ_j α_ij · Wh[j] // SpMM (AVX2 FMA)
This eliminates ~7 × nnz bytes of intermediate allocations, replacing them with a single per-row buffer of size max_degree × 4 bytes.
SAGE Fusion: Aggregation + Dual-Matmul
The unfused SAGE forward allocates:
| Step | Operation | Intermediate | Size |
|---|---|---|---|
| 1 | spmm(A, H) | agg tensor |
N × F_in × 4 bytes |
| 2 | matmul(agg, W_neigh) | h_neigh tensor |
N × F_out × 4 bytes |
| 3 | matmul(H, W_self) | h_self tensor |
N × F_out × 4 bytes |
The fused kernel computes everything row-by-row:
for each node i: // parallelized with OpenMP
agg_row = mean({H[j] : j ∈ N(i)}) // AVX2 accumulate
out[i] = agg_row · W_neigh + H[i] · W_self // dual-matmul (AVX2 FMA)
This eliminates the N × F_in aggregation tensor and the N × F_out h_self tensor, needing only an F_in-sized per-row buffer.
Benchmark Results
Machine-dependent results — fused vs. unfused timings below are representative for AVX2 CPUs with multi-level caches. The crossover point (where fusion starts winning) shifts with L2/L3 cache size and memory bandwidth.
Tested on Intel Core 9 270H (8 threads), GCC 13.3, Ubuntu 24.04 (WSL2):
GAT: Fused SpSDDMM + edge_softmax + SpMM
| Configuration | Unfused | Fused | Speedup | Memory Ratio |
|---|---|---|---|---|
| Cora-like (2708×1433→8, deg≈5) | 4,867 µs | 5,391 µs | 0.90× | 2.8× |
| Medium (5000×128→64, deg≈10) | 4,211 µs | 3,129 µs | 1.35× | 1.4× |
| Large (10000×64→32, deg≈20) | 5,717 µs | 3,534 µs | 1.62× | 2.6× |
| XL (20000×32→16, deg≈30) | 12,323 µs | 4,726 µs | 2.61× | 5.6× |
SAGE: Fused Aggregation + Dual-Matmul
| Configuration | Unfused | Fused | Speedup | Memory Ratio |
|---|---|---|---|---|
| Cora-like (2708×1433→128, deg≈5) | 60,259 µs | 20,209 µs | 2.98× | 13.1× |
| Medium (5000×128→64, deg≈10) | 6,107 µs | 1,858 µs | 3.29× | 4.0× |
| Large (10000×64→32, deg≈20) | 5,631 µs | 3,293 µs | 1.71× | 4.0× |
| XL (20000×128→64, deg≈30) | 30,015 µs | 18,246 µs | 1.64× | 4.0× |
Key observations:
- GAT fusion scales with graph density — at deg≈30, the nnz-sized intermediates dominate and fusion saves 5.6× memory with 2.61× speedup
- SAGE fusion excels when F_in is large — Cora layer 1 (F_in=1433) shows 13.1× memory reduction and 2.98× speedup by avoiding the N×F_in aggregate tensor
- Cora-like GAT is slightly slower (0.90×) because the per-row softmax buffer allocation overhead dominates for very small neighborhoods (deg≈5)
Valgrind Massif Profiling
Massif heap profiling on GAT (N=10000, F_in=64, F_out=32, deg≈20, nnz≈210K):
| Variant | Peak Heap | Savings |
|---|---|---|
| Unfused | 10,857,004 bytes (10.35 MB) | — |
| Fused | 7,420,824 bytes (7.08 MB) | 3.44 MB (31.6%) |
Run profiling: bash scripts/run_massif_phase9.sh
Phase 11 -- Packaging, CI/CD & Documentation
Goal
Make TinyGNN a "one-click install" package ready for JMLR MLOSS or JOSS submission, with professional CI/CD, comprehensive documentation, and PyPI packaging.
PyPI Packaging
TinyGNN ships as a standard Python package with a pybind11 C++ extension:
pip install tinygnn
Package structure:
pyproject.toml— PEP 517/518 metadata (name, version, authors, classifiers, dependencies)setup.py— Extension build configuration (C++17 flags, OpenMP, AVX2, MinGW/MSVC/GCC support)MANIFEST.in— Source distribution manifest (ensures headers, sources, docs are included)LICENSE— MIT LicenseCITATION.cff— Machine-readable citation metadata for JOSS/JMLR
Classifiers:
Development Status :: 4 - BetaTopic :: Scientific/Engineering :: Artificial IntelligenceLicense :: OSI Approved :: MIT License- Python 3.8–3.13 support
CI/CD (GitHub Actions)
Every push to main and every PR triggers 5 parallel jobs:
| Job | Platform | What it does |
|---|---|---|
| C++ Build + Test | Ubuntu (GCC + Clang), macOS, Windows (MSVC) | CMake build → CTest (18,000+ assertions) |
| Python Build + Test | Ubuntu (3.10, 3.12), macOS (3.12), Windows (3.12) | pip install → pytest (49 tests) |
| Sanitizers | Ubuntu (GCC) | ASan + UBSan + leak detection (-DCMAKE_EXE_LINKER_FLAGS) |
| Valgrind | Ubuntu (GCC) | Memcheck (0 errors, 0 leaks) + Helgrind (0 data races) |
| Documentation | Ubuntu | Doxygen (C++) + Sphinx (Python) → artifact upload |
Documentation
C++ API (Doxygen):
- Configured via
Doxyfilein project root - Extracts from annotated headers in
include/tinygnn/ - Generates HTML + XML (for Breathe cross-reference)
- Class diagrams, include graphs (via Graphviz)
- Build:
doxygen Doxyfile→docs/doxygen/html/
Python API (Sphinx):
- Read the Docs theme (
sphinx-rtd-theme) - Cross-references C++ docs via Breathe bridge
- Pages: Installation, Quick Start, Python API, C++ API, Benchmarks, Contributing
- Build:
cd docs && make html→docs/_build/html/
Authors
Jai Ansh Singh Bindra and Anubhav Choudhery (under JBAC EdTech)
Cumulative Test Statistics
| Phase | Test file | Functions | Assertions | Result |
|---|---|---|---|---|
| 1 | test_tensor.cpp | 18 | 104 | 104 / 104 |
| 2 | test_graph_loader.cpp | 49 | 269 | 269 / 269 |
| 3 | test_matmul.cpp | 30 | 268 | 268 / 268 |
| 4 | test_spmm.cpp | 35 | 306 | 306 / 306 |
| 5 | test_activations.cpp | 70 | 266 | 266 / 266 |
| 6a | test_gcn.cpp | 68 | 764 | 764 / 764 |
| 6b | test_graphsage.cpp | 60 | 1,011 | 1,011 / 1,011 |
| 6c | test_gat.cpp | 60 | 1,284 | 1,284 / 1,284 |
| 7 | test_e2e.cpp | 26 | 13,862 | 13,862 / 13,862 |
| 8 | test_python_bindings.py | 49 | 49 (pytest) | 49 / 49 |
| Total | 465 | 18,183 | 18,183 / 18,183 |
Memory Safety
All seven phases pass the full sanitizer matrix with zero errors:
| Config | Tensor | Graph Loader | Matmul | SpMM | Activations | GCN | GraphSAGE | GAT | E2E |
|---|---|---|---|---|---|---|---|---|---|
| ASan + LSan | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| UBSan | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| ASan + UBSan combined | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass | Pass |
| Valgrind Memcheck | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks | 0 errors, 0 leaks |
| Valgrind Helgrind | 0 races | 0 races | 0 races | 0 races | 0 races | 0 races | 0 races | 0 races | 0 races |
Design Principles
- Zero dependencies -- no Eigen, no BLAS, no Boost, no external libraries of any kind
- Exact memory accounting -- memory_footprint_bytes() returns precise byte counts for both storage formats
- Fail-fast validation -- CSR construction validates row_ptr monotonicity, column index bounds, and size consistency; matmul validates format and inner dimensions before allocating output
- Unified interface -- Dense and SparseCSR tensors share the same Tensor type; format is a runtime tag, not a separate class hierarchy
- Cache-aware loop ordering -- GEMM uses (i,k,j) order with a register-hoisted scalar to enable compiler auto-vectorisation on the inner loop
- Explicit SIMD -- AVX2 FMA intrinsics (
_mm256_fmadd_ps) in matmul and SpMM inner loops give 8-wide float throughput with scalar fallback for portability - Thread-level parallelism -- OpenMP
parallel foron all outer loops (rows/nodes) withschedule(dynamic)for load-balanced graph kernels - Sparse-native message passing -- SpMM walks the CSR structure directly (row_ptr → col_ind → accumulate), avoiding any sparse-to-dense conversion
- In-place activations -- all 8 activation functions modify tensors in-place with O(1) extra memory, avoiding unnecessary allocations in GNN inference pipelines
- Numerical stability -- softmax/log-softmax use the max-subtraction trick; sigmoid uses a two-branch formula to prevent overflow for extreme inputs
- Operator fusion -- GAT fuses SpSDDMM + edge_softmax + SpMM into a single row-wise loop (eliminating nnz-sized intermediate CSR tensors); SAGE fuses aggregation + dual-matmul (eliminating N×F_in intermediate)
- One-click install --
pip install tinygnnbuilds the C++ extension, links statically, and ships a clean Python package; GitHub Actions CI/CD validates every commit across 4 platforms
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tinygnn-0.1.5.tar.gz.
File metadata
- Download URL: tinygnn-0.1.5.tar.gz
- Upload date:
- Size: 251.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00be0801b581db50989ff5241a938c745d6658de25a969c7d76f26214b842ef0
|
|
| MD5 |
5d9914e209bd8b83afa21b6353d943c8
|
|
| BLAKE2b-256 |
b6c8a6a25be340d629adbbc282b5655dab49478818aba104f49b7db95c9e77bc
|
File details
Details for the file tinygnn-0.1.5-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: tinygnn-0.1.5-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 795.1 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a062f6309d0d89613b12e610ff64676337b4ee20e6d73951119e6bcf06ed21b3
|
|
| MD5 |
f591e9a8a932dd97c414bbd2d655814c
|
|
| BLAKE2b-256 |
286033734bed1014734516794385153ffbaab7005d3c309c7ef73bb5c271a97a
|