PyAGC: A PyTorch library for Attributed Graph Clustering.
Project description
Bridging Academia and Industry for Attributed Graph Clustering
Benchmark Paper | Survey Paper | Docs | PyPI | Benchmark Results | Awesome AGC Papers
PyAGC is a production-ready, modular library and comprehensive benchmark for Attributed Graph Clustering (AGC), built on PyTorch and PyTorch Geometric. It unifies 20+ state-of-the-art algorithms under a principled Encode-Cluster-Optimize (ECO) framework, provides mini-batch implementations that scale to 111 million nodes on a single 32GB GPU, and introduces a holistic evaluation protocol spanning supervised, unsupervised, and efficiency metrics across 12 diverse datasets.
Battle-tested in high-stakes industrial workflows at Ant Group (Fraud Detection, Anti-Money Laundering, User Profiling), PyAGC offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment.
News
- [2026-03-24] Survey paper is now available on arXiv!
- [2026-03-02] Added Awesome AGC Papers, a curated list of attributed graph clustering papers!
- [2026-02-10] Benchmark paper is now available on arXiv!
- [2026-02-09] Initial release of PyAGC!
Table of Contents
- Why PyAGC?
- Key Features
- Project Structure
- Installation
- Quick Start
- The ECO Framework
- Benchmark
- Usage
- Extending PyAGC
- Awesome AGC Papers
- FAQ
- Citation
- Contributing
- License
- Acknowledgements
Why PyAGC?
Current AGC evaluation suffers from four critical limitations that PyAGC is designed to address:
| Problem | Status Quo | PyAGC Solution |
|---|---|---|
| The Cora-fication of Datasets | Over-reliance on small, homophilous citation networks | 12 datasets spanning 5 orders of magnitude, including industrial graphs with tabular features and low homophily |
| The Scalability Bottleneck | Full-batch training limits methods to ~10โต nodes | Mini-batch implementations enabling training on 111M+ nodes with a single 32GB GPU |
| The Supervised Metric Paradox | Unsupervised methods evaluated only with supervised metrics | Holistic evaluation with unsupervised structural metrics (Modularity, Conductance) + efficiency profiling |
| The Reproducibility Gap | Scattered codebases with hard-coded parameters | Unified, configuration-driven framework with strict YAML-based experiment management |
Key Features
-
๐ Diverse Dataset Collection โ 12 graphs from 2.7K to 111M nodes across Citation, Social, E-commerce, and Web domains, featuring both textual and tabular attributes with varying homophily levels.
-
๐งฉ Unified Algorithm Framework โ 20+ SOTA methods organized under the Encode-Cluster-Optimize taxonomy with modular, interchangeable encoders, cluster heads, and optimization strategies.
-
๐ Holistic Evaluation Protocol โ Supervised metrics (ACC, NMI, ARI, F1), unsupervised structural metrics (Modularity, Conductance), and comprehensive efficiency profiling (time, memory).
-
๐ Production-Grade Scalability โ GPU-accelerated KMeans (via PyTorch + Triton) and neighbor-sampling-based mini-batch training that scales deep clustering to 111M nodes on a single 32GB V100 GPU.
-
๐ ๏ธ Developer-Friendly Design โ Plug-and-play components, YAML-driven configuration, and clean abstractions that make prototyping new methods as easy as swapping a single config line.
-
๐ Curated Paper Collection โ A companion Awesome AGC Papers list covering surveys, benchmarks, and the latest research in attributed graph clustering.
Project Structure
PyAGC/
โโโ pyagc/ # Core library
โ โโโ encoders/ # GNN backbones (GCN, GAT, SAGE, GIN, Transformers)
โ โโโ clusters/ # Cluster heads (KMeans, DEC, DMoN, MinCut, Neuromap, ...)
โ โโโ models/ # Full model implementations (20+ methods)
โ โโโ data/ # Unified dataset loaders
โ โโโ metrics/ # Supervised + unsupervised metrics
โ โโโ transforms/ # Graph augmentations (edge drop, feature mask)
โ โโโ utils/ # Checkpointing, logging, misc utilities
โโโ benchmark/ # Reproducible experiments
โ โโโ <Method>/ # Per-method directory
โ โ โโโ main.py # Entry point
โ โ โโโ train.conf.yaml # Hyperparameter configuration
โ โ โโโ logs/ # Experiment logs per dataset
โ โโโ data/ # Cached datasets
โ โโโ results/ # Aggregated benchmark results
โโโ AWESOME_AGC.md # Curated list of AGC papers
โโโ tests/ # Unit tests
โโโ docs/ # Documentation (Sphinx โ ReadTheDocs)
Installation
From PyPI (Recommended)
pip install pyagc
From Source
git clone https://github.com/Cloudy1225/PyAGC.git
cd PyAGC
pip install -e .
Prerequisites
- Python >= 3.10
- PyTorch >= 2.6.0
- PyTorch Geometric >= 2.7.0
Quick Start
import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.encoders import GCN
from pyagc.models import DGI
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics, structure_metrics
# 1. Load dataset
x, edge_index, y = get_dataset('Cora', root='data/')
data = Data(x=x, edge_index=edge_index, y=y)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 2. Build model (Encode + Optimize)
encoder = GCN(in_channels=data.num_features, hidden_channels=512, num_layers=1)
model = DGI(hidden_channels=512, encoder=encoder).to(device)
# 3. Train encoder
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(200):
loss = model.train_full(data, optimizer, epoch, verbose=(epoch % 50 == 0))
# 4. Cluster (Cluster projection)
model.eval()
with torch.no_grad():
z = model.infer_full(data)
n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters)
clusters = kmeans.fit_predict(z)
# 5. Evaluate โ supervised + unsupervised
sup = label_metrics(y, clusters, metrics=['ACC', 'NMI', 'ARI', 'F1'])
unsup = structure_metrics(edge_index, clusters, metrics=['Modularity', 'Conductance'])
print(f"ACC: {sup['ACC']:.4f} | NMI: {sup['NMI']:.4f} | ARI: {sup['ARI']:.4f}")
print(f"Modularity: {unsup['Modularity']:.4f} | Conductance: {unsup['Conductance']:.4f}")
The ECO Framework
PyAGC organizes the landscape of AGC algorithms under a unified Encode-Cluster-Optimize (ECO) framework, formally introduced in our survey paper:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Encode-Cluster-Optimize โ
โ โ
(A, X) โโโโโโโบ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โโโโโโโบ Clusters
โ โ Encoder โโโโโบโ Cluster โโโโโบโ Optimizer โ โ
โ โ (E) โ โ Head (C) โ โ (O) โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Module | Options | Examples |
|---|---|---|
| Encoder | Parametric | GCN, GAT, GraphSAGE, GIN, SGFormer, Polynormer |
| Non-Parametric | Fixed graph filters, adaptive smoothing, Markov diffusion | |
| Cluster | Differentiable | Softmax pooling (DMoN, MinCut, Neuromap), Prototype-based (DEC, DinkNet) |
| Discrete (Post-hoc) | KMeans, Spectral Clustering, Subspace Clustering | |
| Optimizer | Joint | End-to-end: Self-supervised + Clustering-specific loss |
| Decoupled | Pre-train encoder โ Apply discrete clustering |
This decomposition enables plug-and-play experimentation โ swap a GCN encoder for a GAT within DAEGC by changing one line in the config file. For a comprehensive theoretical analysis of the ECO framework and an industrial perspective on AGC, see our survey paper.
Benchmark
Datasets
Our benchmark curates 12 datasets spanning 5 orders of magnitude in scale, diverse domains, feature modalities, and homophily levels:
| Scale | Dataset | Domain | #Nodes | #Edges | Avg. Deg. | #Feat. | Feat. Type | #Clusters | $\mathcal{H}_e$ | $\mathcal{H}_n$ |
|---|---|---|---|---|---|---|---|---|---|---|
| Tiny | Cora | Citation | 2,708 | 10,556 | 3.9 | 1,433 | Textual | 7 | 0.81 | 0.83 |
| Tiny | Photo | Co-purchase | 7,650 | 238,162 | 31.1 | 745 | Textual | 8 | 0.83 | 0.84 |
| Small | Physics | Co-author | 34,493 | 495,924 | 14.4 | 8,415 | Textual | 5 | 0.93 | 0.92 |
| Small | HM | Co-purchase | 46,563 | 21,461,990 | 460.9 | 120 | Tabular | 21 | 0.16 | 0.35 |
| Small | Flickr | Social | 89,250 | 899,756 | 10.1 | 500 | Textual | 7 | 0.32 | 0.32 |
| Medium | ArXiv | Citation | 169,343 | 1,166,243 | 6.9 | 128 | Textual | 40 | 0.65 | 0.64 |
| Medium | Social | 232,965 | 23,213,838 | 99.6 | 602 | Textual | 41 | 0.78 | 0.81 | |
| Medium | MAG | Citation | 736,389 | 10,792,672 | 14.7 | 128 | Textual | 349 | 0.30 | 0.31 |
| Large | Pokec | Social | 1,632,803 | 44,603,928 | 27.3 | 56 | Tabular | 183 | 0.43 | 0.39 |
| Large | Products | Co-purchase | 2,449,029 | 61,859,140 | 25.4 | 100 | Textual | 47 | 0.81 | 0.82 |
| Large | WebTopic | Web | 2,890,331 | 24,754,822 | 8.6 | 528 | Tabular | 28 | 0.22 | 0.24 |
| Massive | Papers100M | Citation | 111,059,956 | 1,615,685,872 | 14.5 | 128 | Textual | 172 | 0.57 | 0.50 |
Key diversity dimensions:
- Scale: 5 orders of magnitude (2.7K โ 111M nodes)
- Attributes: textual (bag-of-words, embeddings) and tabular (categorical + numerical)
- Structure: high-homophily (Physics, $\mathcal{H}_e$=0.93) to heterophilous (HM, $\mathcal{H}_e$=0.16)
- Domain: citation, co-purchase, co-author, social networks, web graphs
Algorithms
Traditional Methods
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| KMeans | โ | None (raw features) | Discrete (KMeans) | Decoupled |
| Node2Vec | KDD'16 | Random Walk | Discrete (KMeans) | Decoupled |
Non-Parametric Methods
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| SSGC | ICLR'21 | Adaptive Filter | Discrete (KMeans) | Decoupled |
| SAGSC | AAAI'23 | Fixed Filter | Discrete (Subspace) | Decoupled |
| MS2CAG | KDD'25 | Fixed Filter | Discrete (SNEM) | Decoupled |
Deep Decoupled Methods
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| GAE | NeurIPS-W'16 | GCN | KMeans | Graph Reconstruction |
| DGI | ICLR'19 | GCN | KMeans | Mutual Info Maximization |
| CCASSG | NeurIPS'21 | GCN | KMeans | Redundancy Reduction |
| S3GC | NeurIPS'22 | GCN | KMeans | Contrastive (Random Walk) |
| NS4GC | TKDE'24 | GCN | KMeans | Contrastive (Node Similarity) |
| MAGI | KDD'24 | GNN | KMeans | Contrastive (Modularity) |
Deep Joint Methods
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| DAEGC | IJCAI'19 | GAT | Prototype (DEC) | Reconstruction + KL Div. |
| MinCut | ICML'20 | GCN | Softmax | Cut Minimization |
| DMoN | JMLR'23 | GCN | Softmax | Modularity Maximization |
| DinkNet | ICML'23 | GCN | Prototype | Dilation + Shrink Loss |
| Neuromap | NeurIPS'24 | GCN | Softmax | Map Equation |
Evaluation Protocol
We advocate for a holistic evaluation that goes beyond the standard supervised metric paradox:
Supervised Alignment Metrics
Measure agreement with ground-truth labels (when available):
- ACC โ Clustering Accuracy (with optimal Hungarian matching)
- NMI โ Normalized Mutual Information
- ARI โ Adjusted Rand Index
- Macro-F1 โ Macro-averaged F1 Score
Unsupervised Structural Metrics
Assess intrinsic cluster quality without labels โ critical for real-world deployment:
- Modularity โ density of within-cluster edges vs. random expectation (โ better)
- Conductance โ fraction of edge volume pointing outside clusters (โ better)
Efficiency Profiling
- Training time, inference latency, and peak GPU memory consumption
from pyagc.metrics import label_metrics, structure_metrics
# Supervised
sup = label_metrics(y_true, y_pred, metrics=['ACC', 'NMI', 'ARI', 'F1'])
# Unsupervised
unsup = structure_metrics(edge_index, y_pred, metrics=['Modularity', 'Conductance'])
Benchmark Results
Full results with all metrics are available in benchmark/results/ and our paper.
๐ Complete benchmark results including ACC, ARI, F1, Modularity, Conductance, training time, and GPU memory are available in the Structured Results and Unstructured Results.
Reproducibility
All experiments are fully reproducible via configuration files:
# Reproduce exact benchmark results
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora --seed 0
python main.py --config train.conf.yaml --dataset Cora --seed 1
python main.py --config train.conf.yaml --dataset Cora --seed 2
python main.py --config train.conf.yaml --dataset Cora --seed 3
python main.py --config train.conf.yaml --dataset Cora --seed 4
Each run produces a timestamped log file in logs/<Dataset>/<method>/ containing:
- All hyperparameters
- Training loss curves
- Final metric values (supervised + unsupervised)
- Runtime and memory statistics
Usage
Running Benchmarks
Each algorithm has a self-contained directory with main.py and a YAML configuration:
# Run DMoN on Cora
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora
# Run DAEGC on Reddit (mini-batch)
cd benchmark/DAEGC
python main.py --config train.conf.yaml --dataset Reddit
Results are automatically logged to benchmark/<Method>/logs/<Dataset>/.
Custom Experiments
PyAGC's modular design makes it easy to compose new methods:
from pyagc.encoders import GCN, GAT
from pyagc.models import DMoN
# Swap GCN โ GAT in DMoN by changing one line
encoder = GAT(in_channels=1433, hidden_channels=256, num_layers=2)
model = DMoN(encoder=encoder, n_features=256, n_clusters=7)
Or simply modify the YAML config:
encoder:
type: GAT # Changed from GCN
hidden_channels: 256
num_layers: 2
cluster:
type: DMoN
n_clusters: 7
Scaling to Large Graphs
PyAGC enables training on massive graphs via mini-batch neighbor sampling:
from torch_geometric.loader import NeighborLoader
# Create mini-batch loader
loader = NeighborLoader(
data,
num_neighbors=[15, 10],
batch_size=1024,
shuffle=True,
)
# Mini-batch training loop
for batch in loader:
batch = batch.to(device)
loss = model.train_mini_batch(batch, optimizer)
Scalability highlight: Complex models (e.g., DAEGC) can be trained on Papers100M (111M nodes, 1.6B edges) on a single 32GB V100 GPU in under 2 hours.
Extending PyAGC
Adding a New Encoder
from pyagc.encoders import GCN
# Use any PyG-compatible encoder
encoder = GCN(
in_channels=128,
hidden_channels=256,
num_layers=3,
dropout=0.1
)
# Plug into any model
model = DMoN(encoder=encoder, n_clusters=7)
Adding a New Cluster Head
# pyagc/clusters/my_cluster_head.py
from pyagc.clusters import BaseClusterHead
class MyClusterHead(BaseClusterHead):
def __init__(self, n_clusters, in_channels):
super().__init__(n_clusters)
# Define learnable parameters
...
def forward(self, *args, **kwargs):
# Return clustering loss
...
return loss
def cluster(self, z, soft=True):
# Return soft assignment matrix P of shape [N, K]
...
return p
Adding a New Model
# pyagc/models/my_model.py
from pyagc.models import BaseModel
class MyModel(BaseModel):
def __init__(self, encoder, cluster_head):
super().__init__()
self.encoder = encoder
self.cluster_head = cluster_head
def forward(self, data):
z = self.encoder(data.x, data.edge_index)
return z
def loss(self, data):
z = self.forward(data)
rep_loss = ... # Representation learning loss
clust_loss = self.cluster_head(z, data.edge_index)
return rep_loss + self.lambda_ * clust_loss
Awesome AGC Papers
We maintain a curated reading list of attributed graph clustering research, covering surveys, benchmarks, and the latest methods โ organized to complement the PyAGC benchmark.
๐ View the Full Paper List โ
Highlights
| Category | Notable Works |
|---|---|
| Survey & Benchmark | Beyond the Academic Monoculture (Ours), A Survey of Deep Graph Clustering, DGCBench, PyAGC Benchmark (Ours) |
| Non-Parametric | SSGC (ICLR'21), SAGSC (AAAI'23), MS2CAG (KDD'25) |
| Deep Decoupled | DGI (ICLR'19), S3GC (NeurIPS'22), MAGI (KDD'24) |
| Deep Joint | DMoN (JMLR'23), DinkNet (ICML'23), Neuromap (NeurIPS'24) |
๐ก Found a paper that should be listed? Feel free to open a PR to
AWESOME_AGC.md!
FAQ
Q: How do I run experiments on my own graph?
1. Format your graph as a PyTorch Geometric `Data` object with `x` (node features), `edge_index` (edge list), and optionally `y` (labels for evaluation). 2. Use any model from `pyagc.models` with your chosen encoder and cluster head.Q: Can I use PyAGC without ground-truth labels?
Absolutely โ this is the core use case PyAGC is designed for. Use unsupervised structural metrics (Modularity, Conductance) via `pyagc.metrics.structure_metrics` to evaluate cluster quality without any labels.Q: How does mini-batch training work for graph clustering?
We use neighbor sampling (via PyTorch Geometric's `NeighborLoader`) to create computational subgraphs. The encoder processes these subgraphs, and losses are approximated over mini-batches. This decouples GPU memory from graph size, enabling training on graphs with 100M+ nodes on a single GPU.Q: What GPU do I need?
All benchmark experiments were conducted on a single NVIDIA Tesla V100 (32GB). For small/medium datasets, a GPU with 8โ16GB is sufficient. For Papers100M, we recommend at least 32GB GPU memory.Citation
If you find PyAGC useful in your research, please cite our papers:
@article{liu2026bridging,
title={Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering},
author={Yunhui Liu and Pengyu Qiu and Yu Xing and Yongchao Liu and Peng Du and Chuntao Hong and Jiajun Zheng and Tao Zheng and Tieke He},
year={2026},
eprint={2602.08519},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@article{liu2026beyond,
title={Beyond the Academic Monoculture: A Unified Framework and Industrial Perspective for Attributed Graph Clustering},
author={Yunhui Liu and Yue Liu and Yongchao Liu and Tao Zheng and Stan Z. Li and Xinwang Liu and Tieke He},
year={2026},
eprint={2603.20829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Contributing
We welcome contributions! Please see our contributing guidelines:
- Bug Reports: Open an issue with a minimal reproducible example.
- New Methods: Submit a PR adding your method under the ECO framework with a
main.py,train.conf.yaml, and unit tests. - New Datasets: Submit a PR with a data loader and dataset description.
- Paper List: Submit a PR to
AWESOME_AGC.mdto add newly published AGC papers. - Documentation: Improvements to docs, tutorials, and examples are always appreciated.
License
PyAGC is released under the MIT License.
Acknowledgements
PyAGC is built upon the excellent open-source ecosystem:
We thank Ant Group for supporting the industrial validation of this benchmark.
GitHub ยท PyPI ยท Docs ยท Benchmark Paper ยท Survey Paper ยท Awesome AGC Papers
Made with โค๏ธ for the Graph ML Community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyagc-1.1.0.tar.gz.
File metadata
- Download URL: pyagc-1.1.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06ac3be831e81368045b59ec278920c5a3a31dbda69cad0dffa9fecc07616e5b
|
|
| MD5 |
0975c2670453e403d12af114da1fef18
|
|
| BLAKE2b-256 |
3ca1175f5d8d514224b0a71d4e918d9ddd467336e4b0b5c9db2abcb2e11e1543
|
File details
Details for the file pyagc-1.1.0-py3-none-any.whl.
File metadata
- Download URL: pyagc-1.1.0-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04fbd0365229b5c834df18d3431baca161f9c4818257b330a1aa2eeb30071fb0
|
|
| MD5 |
4387079b46d928816e07d9b9e0e28437
|
|
| BLAKE2b-256 |
91bb8a77499fbfd4b9cd4a6da5d786fa83d1eef5e4f7e30f18526753b28dd109
|