PyAGC: A PyTorch library for Attributed Graph Clustering.
Project description
Bridging Academia and Industry for Attributed Graph Clustering
PyAGC is a production-ready, modular library and comprehensive benchmark for Attributed Graph Clustering (AGC), built on PyTorch and PyTorch Geometric. It unifies 20+ state-of-the-art algorithms under a principled Encode-Cluster-Optimize (ECO) framework, provides mini-batch implementations that scale to 111 million nodes on a single 32GB GPU, and introduces a holistic evaluation protocol spanning supervised, unsupervised, and efficiency metrics across 12 diverse datasets.
Battle-tested in high-stakes industrial workflows at Ant Group (Fraud Detection, Anti-Money Laundering, User Profiling), PyAGC offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment.
Table of Contents
- Why PyAGC?
- Key Features
- Project Structure
- Installation
- Quick Start
- The ECO Framework
- Benchmark
- Usage
- Extending PyAGC
- FAQ
- Citation
- Contributing
- License
- Acknowledgements
Why PyAGC?
Current AGC evaluation suffers from four critical limitations that PyAGC is designed to address:
| Problem | Status Quo | PyAGC Solution |
|---|---|---|
| The Cora-fication of Datasets | Over-reliance on small, homophilous citation networks | 12 datasets spanning 5 orders of magnitude, including industrial graphs with tabular features and low homophily |
| The Scalability Bottleneck | Full-batch training limits methods to ~10โต nodes | Mini-batch implementations enabling training on 111M+ nodes with a single 32GB GPU |
| The Supervised Metric Paradox | Unsupervised methods evaluated only with supervised metrics | Holistic evaluation with unsupervised structural metrics (Modularity, Conductance) + efficiency profiling |
| The Reproducibility Gap | Scattered codebases with hard-coded parameters | Unified, configuration-driven framework with strict YAML-based experiment management |
Key Features
-
๐ Diverse Dataset Collection โ 12 graphs from 2.7K to 111M nodes across Citation, Social, E-commerce, and Web domains, featuring both textual and tabular attributes with varying homophily levels.
-
๐งฉ Unified Algorithm Framework โ 20+ SOTA methods organized under the Encode-Cluster-Optimize taxonomy with modular, interchangeable encoders, cluster heads, and optimization strategies.
-
๐ Holistic Evaluation Protocol โ Supervised metrics (ACC, NMI, ARI, F1), unsupervised structural metrics (Modularity, Conductance), and comprehensive efficiency profiling (time, memory).
-
๐ Production-Grade Scalability โ GPU-accelerated KMeans (via PyTorch + Triton) and neighbor-sampling-based mini-batch training that scales deep clustering to 111M nodes on a single 32GB V100 GPU.
-
๐ ๏ธ Developer-Friendly Design โ Plug-and-play components, YAML-driven configuration, and clean abstractions that make prototyping new methods as easy as swapping a single config line.
Project Structure
PyAGC/
โโโ pyagc/ # Core library
โ โโโ encoders/ # GNN backbones (GCN, GAT, SAGE, GIN, Transformers)
โ โโโ clusters/ # Cluster heads (KMeans, DEC, DMoN, MinCut, Neuromap, ...)
โ โโโ models/ # Full model implementations (20+ methods)
โ โโโ data/ # Unified dataset loaders
โ โโโ metrics/ # Supervised + unsupervised metrics
โ โโโ transforms/ # Graph augmentations (edge drop, feature mask)
โ โโโ utils/ # Checkpointing, logging, misc utilities
โโโ benchmark/ # Reproducible experiments
โ โโโ <Method>/ # Per-method directory
โ โ โโโ main.py # Entry point
โ โ โโโ train.conf.yaml # Hyperparameter configuration
โ โ โโโ logs/ # Experiment logs per dataset
โ โโโ data/ # Cached datasets
โ โโโ results/ # Aggregated benchmark results
โโโ tests/ # Unit tests
โโโ docs/ # Documentation (Sphinx โ ReadTheDocs)
Installation
From PyPI (Recommended)
pip install pyagc
From Source
git clone https://github.com/Cloudy1225/PyAGC.git
cd PyAGC
pip install -e .
Prerequisites
- Python >= 3.10
- PyTorch >= 2.6.0
- PyTorch Geometric >= 2.7.0
Quick Start
import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.encoders import GCN
from pyagc.models import DGI
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics, structure_metrics
# 1. Load dataset
x, edge_index, y = get_dataset('Cora', root='data/')
data = Data(x=x, edge_index=edge_index, y=y)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 2. Build model (Encode + Optimize)
encoder = GCN(in_channels=data.num_features, hidden_channels=512, num_layers=1)
model = DGI(hidden_channels=512, encoder=encoder).to(device)
# 3. Train encoder
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(200):
loss = model.train_full(data, optimizer, epoch, verbose=(epoch % 50 == 0))
# 4. Cluster (Cluster projection)
model.eval()
with torch.no_grad():
z = model.infer_full(data)
n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters)
clusters = kmeans.fit_predict(z)
# 5. Evaluate โ supervised + unsupervised
sup = label_metrics(y, clusters, metrics=['ACC', 'NMI', 'ARI', 'F1'])
unsup = structure_metrics(edge_index, clusters, metrics=['Modularity', 'Conductance'])
print(f"ACC: {sup['ACC']:.4f} | NMI: {sup['NMI']:.4f} | ARI: {sup['ARI']:.4f}")
print(f"Modularity: {unsup['Modularity']:.4f} | Conductance: {unsup['Conductance']:.4f}")
The ECO Framework
PyAGC organizes the landscape of AGC algorithms under a unified Encode-Cluster-Optimize (ECO) framework:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Encode-Cluster-Optimize โ
โ โ
(A, X) โโโโโโโบ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โโโโโโโบ Clusters
โ โ Encoder โโโโโบโ Cluster โโโโโบโ Optimizer โ โ
โ โ (E) โ โ Head (C) โ โ (O) โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Module | Options | Examples |
|---|---|---|
| Encoder | Parametric | GCN, GAT, GraphSAGE, GIN, SGFormer, Polynormer |
| Non-Parametric | Fixed graph filters, adaptive smoothing, Markov diffusion | |
| Cluster | Differentiable | Softmax pooling (DMoN, MinCut, Neuromap), Prototype-based (DEC, DinkNet) |
| Discrete (Post-hoc) | KMeans, Spectral Clustering, Subspace Clustering | |
| Optimizer | Joint | End-to-end: Self-supevised + Clustering- specific loss |
| Decoupled | Pre-train encoder โ Apply discrete clustering |
This decomposition enables plug-and-play experimentation โ swap a GCN encoder for a GAT within DAEGC by changing one line in the config file.
Benchmark
Datasets
Our benchmark curates 12 datasets spanning 5 orders of magnitude in scale, diverse domains, feature modalities, and homophily levels:
| Scale | Dataset | Domain | #Nodes | #Edges | Avg. Deg. | #Feat. | Feat. Type | #Clusters | $\mathcal{H}_e$ | $\mathcal{H}_n$ |
|---|---|---|---|---|---|---|---|---|---|---|
| Tiny | Cora | Citation | 2,708 | 10,556 | 3.9 | 1,433 | Textual | 7 | 0.81 | 0.83 |
| Tiny | Photo | Co-purchase | 7,650 | 238,162 | 31.1 | 745 | Textual | 8 | 0.83 | 0.84 |
| Small | Physics | Co-author | 34,493 | 495,924 | 14.4 | 8,415 | Textual | 5 | 0.93 | 0.92 |
| Small | HM | Co-purchase | 46,563 | 21,461,990 | 460.9 | 120 | Tabular | 21 | 0.16 | 0.35 |
| Small | Flickr | Social | 89,250 | 899,756 | 10.1 | 500 | Textual | 7 | 0.32 | 0.32 |
| Medium | ArXiv | Citation | 169,343 | 1,166,243 | 6.9 | 128 | Textual | 40 | 0.65 | 0.64 |
| Medium | Social | 232,965 | 23,213,838 | 99.6 | 602 | Textual | 41 | 0.78 | 0.81 | |
| Medium | MAG | Citation | 736,389 | 10,792,672 | 14.7 | 128 | Textual | 349 | 0.30 | 0.31 |
| Large | Pokec | Social | 1,632,803 | 44,603,928 | 27.3 | 56 | Tabular | 183 | 0.43 | 0.39 |
| Large | Products | Co-purchase | 2,449,029 | 61,859,140 | 25.4 | 100 | Textual | 47 | 0.81 | 0.82 |
| Large | WebTopic | Web | 2,890,331 | 24,754,822 | 8.6 | 528 | Tabular | 28 | 0.22 | 0.24 |
| Massive | Papers100M | Citation | 111,059,956 | 1,615,685,872 | 14.5 | 128 | Textual | 172 | 0.57 | 0.50 |
Key diversity dimensions:
- Scale: 5 orders of magnitude (2.7K โ 111M nodes)
- Attributes: textual (bag-of-words, embeddings) and tabular (categorical + numerical)
- Structure: high-homophily (Physics, $\mathcal{H}_e$=0.93) to heterophilous (HM, $\mathcal{H}_e$=0.16)
- Domain: citation, co-purchase, co-author, social networks, web graphs
Algorithms
Traditional Methods
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| KMeans | โ | None (raw features) | Discrete (KMeans) | Decoupled |
| Node2Vec | KDD'16 | Random Walk | Discrete (KMeans) | Decoupled |
Non-Parametric Methods
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| SSGC | ICLR'21 | Adaptive Filter | Discrete (KMeans) | Decoupled |
| SAGSC | AAAI'23 | Fixed Filter | Discrete (Subspace) | Decoupled |
| MS2CAG | KDD'25 | Fixed Filter | Discrete (SNEM) | Decoupled |
Deep Decoupled Methods
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| GAE | NeurIPS-W'16 | GCN | KMeans | Graph Reconstruction |
| DGI | ICLR'19 | GCN | KMeans | Mutual Info Maximization |
| CCASSG | NeurIPS'21 | GCN | KMeans | Redundancy Reduction |
| S3GC | NeurIPS'22 | GCN | KMeans | Contrastive (Random Walk) |
| NS4GC | TKDE'24 | GCN | KMeans | Contrastive (Node Similarity) |
| MAGI | KDD'24 | GNN | KMeans | Contrastive (Modularity) |
Deep Joint Methods
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| DAEGC | IJCAI'19 | GAT | Prototype (DEC) | Reconstruction + KL Div. |
| MinCut | ICML'20 | GCN | Softmax | Cut Minimization |
| DMoN | JMLR'23 | GCN | Softmax | Modularity Maximization |
| DinkNet | ICML'23 | GCN | Prototype | Dilation + Shrink Loss |
| Neuromap | NeurIPS'24 | GCN | Softmax | Map Equation |
Evaluation Protocol
We advocate for a holistic evaluation that goes beyond the standard supervised metric paradox:
Supervised Alignment Metrics
Measure agreement with ground-truth labels (when available):
- ACC โ Clustering Accuracy (with optimal Hungarian matching)
- NMI โ Normalized Mutual Information
- ARI โ Adjusted Rand Index
- Macro-F1 โ Macro-averaged F1 Score
Unsupervised Structural Metrics
Assess intrinsic cluster quality without labels โ critical for real-world deployment:
- Modularity โ density of within-cluster edges vs. random expectation (โ better)
- Conductance โ fraction of edge volume pointing outside clusters (โ better)
Efficiency Profiling
- Training time, inference latency, and peak GPU memory consumption
from pyagc.metrics import label_metrics, structure_metrics
# Supervised
sup = label_metrics(y_true, y_pred, metrics=['ACC', 'NMI', 'ARI', 'F1'])
# Unsupervised
unsup = structure_metrics(edge_index, y_pred, metrics=['Modularity', 'Conductance'])
Benchmark Results
Full results with all metrics are available in benchmark/results/ and our paper.
๐ Complete benchmark results including ACC, ARI, F1, Modularity, Conductance, training time, and GPU memory are available in the Structured Results and Unstructured Results.
Reproducibility
All experiments are fully reproducible via configuration files:
# Reproduce exact benchmark results
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora --seed 0
python main.py --config train.conf.yaml --dataset Cora --seed 1
python main.py --config train.conf.yaml --dataset Cora --seed 2
python main.py --config train.conf.yaml --dataset Cora --seed 3
python main.py --config train.conf.yaml --dataset Cora --seed 4
Each run produces a timestamped log file in logs/<Dataset>/<method>/ containing:
- All hyperparameters
- Training loss curves
- Final metric values (supervised + unsupervised)
- Runtime and memory statistics
Usage
Running Benchmarks
Each algorithm has a self-contained directory with main.py and a YAML configuration:
# Run DMoN on Cora
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora
# Run DAEGC on Reddit (mini-batch)
cd benchmark/DAEGC
python main.py --config train.conf.yaml --dataset Reddit
Results are automatically logged to benchmark/<Method>/logs/<Dataset>/.
Custom Experiments
PyAGC's modular design makes it easy to compose new methods:
from pyagc.encoders import GCN, GAT
from pyagc.clusters import DMoNClusterHead, DECClusterHead
from pyagc.models import DMoN
# Swap GCN โ GAT in DMoN by changing one line
encoder = GAT(in_channels=1433, hidden_channels=256, num_layers=2)
cluster_head = DMoNClusterHead(in_channels=256, n_clusters=7)
model = DMoN(encoder=encoder, cluster_head=cluster_head)
Or simply modify the YAML config:
encoder:
type: GAT # Changed from GCN
hidden_channels: 256
num_layers: 2
cluster:
type: DMoN
n_clusters: 7
Scaling to Large Graphs
PyAGC enables training on massive graphs via mini-batch neighbor sampling:
from torch_geometric.loader import NeighborLoader
# Create mini-batch loader
loader = NeighborLoader(
data,
num_neighbors=[15, 10],
batch_size=1024,
shuffle=True,
)
# Mini-batch training loop
for batch in loader:
batch = batch.to(device)
loss = model.train_mini_batch(batch, optimizer)
Scalability highlight: Complex models (e.g., DAEGC) can be trained on Papers100M (111M nodes, 1.6B edges) on a single 32GB V100 GPU in under 2 hours.
Extending PyAGC
Adding a New Encoder
from pyagc.encoders import GCN
# Use any PyG-compatible encoder
encoder = GCN(
in_channels=128,
hidden_channels=256,
num_layers=3,
dropout=0.1
)
# Plug into any model
model = DMoN(encoder=encoder, n_clusters=7)
Adding a New Cluster Head
# pyagc/clusters/my_cluster_head.py
from pyagc.clusters import BaseClusterHead
class MyClusterHead(BaseClusterHead):
def __init__(self, n_clusters, in_channels):
super().__init__(n_clusters)
# Define learnable parameters
...
def forward(self, *args, **kwargs):
# Return clustering loss
...
return loss
def cluster(self, z, soft=True):
# Return soft assignment matrix P of shape [N, K]
...
return p
Adding a New Model
# pyagc/models/my_model.py
from pyagc.models import BaseModel
class MyModel(BaseModel):
def __init__(self, encoder, cluster_head):
super().__init__()
self.encoder = encoder
self.cluster_head = cluster_head
def forward(self, data):
z = self.encoder(data.x, data.edge_index)
return z
def loss(self, data):
z = self.forward(data)
rep_loss = ... # Representation learning loss
clust_loss = self.cluster_head(z, data.edge_index)
return rep_loss + self.lambda_ * clust_loss
FAQ
Q: How do I run experiments on my own graph?
1. Format your graph as a PyTorch Geometric `Data` object with `x` (node features), `edge_index` (edge list), and optionally `y` (labels for evaluation). 3. Use any model from `pyagc.models` with your chosen encoder and cluster head.Q: Can I use PyAGC without ground-truth labels?
Absolutely โ this is the core use case PyAGC is designed for. Use unsupervised structural metrics (Modularity, Conductance) via `pyagc.metrics.structure_metrics` to evaluate cluster quality without any labels.Q: How does mini-batch training work for graph clustering?
We use neighbor sampling (via PyTorch Geometric's `NeighborLoader`) to create computational subgraphs. The encoder processes these subgraphs, and losses are approximated over mini-batches. This decouples GPU memory from graph size, enabling training on graphs with 100M+ nodes on a single GPU.Q: What GPU do I need?
All benchmark experiments were conducted on a single NVIDIA Tesla V100 (32GB). For small/medium datasets, a GPU with 8โ16GB is sufficient. For Papers100M, we recommend at least 32GB GPU memory.Citation
If you find PyAGC useful in your research, please cite our paper:
@article{liu2026bridging,
title={Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering},
author={Yunhui Liu and Pengyu Qiu and Yu Xing and Yongchao Liu and Peng Du and Chuntao Hong and Jiajun Zheng and Tao Zheng and Tieke He},
year = {2026},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
Contributing
We welcome contributions! Please see our contributing guidelines:
- Bug Reports: Open an issue with a minimal reproducible example.
- New Methods: Submit a PR adding your method under the ECO framework with a
main.py,train.conf.yaml, and unit tests. - New Datasets: Submit a PR with a data loader and dataset description.
- Documentation: Improvements to docs, tutorials, and examples are always appreciated.
License
PyAGC is released under the MIT License.
Acknowledgements
PyAGC is built upon the excellent open-source ecosystem:
We thank Ant Group for supporting the industrial validation of this benchmark.
GitHub ยท PyPI ยท Documentation ยท Paper
Made with โค๏ธ for the Graph ML Community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyagc-1.0.0.tar.gz.
File metadata
- Download URL: pyagc-1.0.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
149c3fc5946636e31c5141e959ef400092cd4e0a75743424cf1999d04d96cac9
|
|
| MD5 |
79e6af2da61dd6d18f98f3a3142bce06
|
|
| BLAKE2b-256 |
af5c5a799080bc39c7c54f547c86dd6502de492357927b2e2c484444254f6826
|
File details
Details for the file pyagc-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pyagc-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89ccc979284c13b156d9a8fbdbaca19b37c1fa9bd68fcd1bed275abd1b868378
|
|
| MD5 |
4f33c6387627ec429edab593cd016324
|
|
| BLAKE2b-256 |
e48b968ceb62f912949e1de96478626f5defecd2c5ae3e33c02dc8bd5e17121a
|