A collection of graph datasets in torch_geometric format.
Project description
Big Graph Dataset
This is a collaboration project to build a large, multi-domain set of graph bgd. Each dataset comprises many small graphs.
The aim of this project is to provide a large set of graph datasets for use in machine learning research. Currently graph datasets are distributed in individual repositories, increasing workload as researchers have to search for relevant resources. Once these datasets are found, there is additional labour in formatting the data for use in deep learning.
- We aim to provide datasets that are:
Composed of many small graphs
Diverse in domain
Diverse in tasks
Well-documented
Formatted uniformly across datasets for Pytorch Geometric
What we’re looking for
In short: anything! The idea behind this being a collaboration is that we cast a wide net over different domains and tasks.
There are a few rules for this first phase (see below) but the quick brief is that we’re looking for datasets of small static graphs with well-defined tasks. That just means that the structure of the graphs don’t vary over time.
If your data is a bit more funky, for example multi-graphs or time-series on graphs, please get in touch and we can discuss how to include it.
In the examples I’ve provided datasets are mostly sampled from one large graph - this is not compulsory.
Contributing
The source can be found in the Github repository<https://github.com/neutralpronoun/big-graph-dataset>, and documentation on the readthedocs page<https://big-graph-dataset.readthedocs.io/en/latest/>.
- The basics:
Create your own git branch
Copy the bgd/example_dataset.py
Have a look through
Re-tool it for your own dataset
See more in Getting Started.
I’ve provided code for sub-sampling graphs and producing statistics.
- A few rules, demonstrated in bgd/real/example_dataset.py:
The datasets need at least a train/val/test split
Datasets should be many small (less than 400 node) graphs
Ideally the number of graphs in each dataset should be controllable
Data should be downloaded in-code to keep the repo small. If this isn’t possible let me know.
Please cite your sources for data in documentation - see the existing datasets for example documentation
Where possible start from existing datasets that have been used in-literature, or if using generators, use generators that are well-understood (for example Erdos-Renyi graphs)
Please document your dataset files with your name and contact information at the top, I’ll check code and merge your branches all at once at the end of the project.
Getting Started
Check out the Reddit dataset example notebook for a quick start guide, then have a look at the source code for the bgd.
My environment is under docs/requirements.txt, use pip install -r requirements. txt within a virtual (Conda etc.) environment to get everything installed.
Datasets
Documentation for the datsets currently in the Big Graph Dataset project.
ToP (Topology Only Pre-Training)
Documentation for the Topology Only Pre-Training component of the project. We are using a pre-trained model to generate embeddings of the graphs in the datasets, hopefully to get some measure of how diverse the datasets are. Very much a work-in-progress!
Credits
This project is maintained by Alex O. Davies, a PhD student at the University of Bristol. Contributors, by default, will be given fair credit upon initial release of the project.
Should you wish your authorship to be anonymous, or if you have any further questions, please contact me at <alexander.davies@bristol.ac.uk>.
Citing
@misc{big-graph-dataset,
title = {{Big Graph Dataset} Documentation},
howpublished = {https://big-graph-dataset.readthedocs.io/}}
Indices and tables
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for big_graph_dataset-0.0.8.post4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98564aa446da1db917b2461f68a3d6c1b3778bd18c4de8fe582589541cafa711 |
|
MD5 | 913dc15c402dc266025b3b2abc267967 |
|
BLAKE2b-256 | 125512f0cb64be05a8e1872fc7350082ca4e53461e6004c9ddfb5a630da46a71 |
Hashes for big_graph_dataset-0.0.8.post4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5de684e6222f929786fa5e872d4b3bc5099b9956e603f12db0fefec93d890bd |
|
MD5 | ef8520355b1eb0307e04743f563783c1 |
|
BLAKE2b-256 | efc837995dc0bb053be66a0bcd4912f0d44fbd9a29c1bb6a8a37b832d6c06de9 |