Skip to main content

A collection of graph datasets in torch_geometric format.

Project description

Big Graph Dataset

This is a collaboration project to build a large, multi-domain set of graph bgd. Each dataset comprises many small graphs.

The aim of this project is to provide a large set of graph datasets for use in machine learning research. Currently graph datasets are distributed in individual repositories, increasing workload as researchers have to search for relevant resources. Once these datasets are found, there is additional labour in formatting the data for use in deep learning.

We aim to provide datasets that are:
  • Composed of many small graphs

  • Diverse in domain

  • Diverse in tasks

  • Well-documented

  • Formatted uniformly across datasets for Pytorch Geometric

What we’re looking for

In short: anything! The idea behind this being a collaboration is that we cast a wide net over different domains and tasks.

There are a few rules for this first phase (see below) but the quick brief is that we’re looking for datasets of small static graphs with well-defined tasks. That just means that the structure of the graphs don’t vary over time.

If your data is a bit more funky, for example multi-graphs or time-series on graphs, please get in touch and we can discuss how to include it.

In the examples I’ve provided datasets are mostly sampled from one large graph - this is not compulsory.

Contributing

The source can be found in the Github repository<https://github.com/neutralpronoun/big-graph-dataset>, and documentation on the readthedocs page<https://big-graph-dataset.readthedocs.io/en/latest/>.

The basics:
  • Create your own git branch

  • Copy the bgd/example_dataset.py

  • Have a look through

  • Re-tool it for your own dataset

See more in Getting Started.


I’ve provided code for sub-sampling graphs and producing statistics.

A few rules, demonstrated in bgd/real/example_dataset.py:
  • The datasets need at least a train/val/test split

  • Datasets should be many small (less than 400 node) graphs

  • Ideally the number of graphs in each dataset should be controllable

  • Data should be downloaded in-code to keep the repo small. If this isn’t possible let me know.

  • Please cite your sources for data in documentation - see the existing datasets for example documentation

  • Where possible start from existing datasets that have been used in-literature, or if using generators, use generators that are well-understood (for example Erdos-Renyi graphs)

Please document your dataset files with your name and contact information at the top, I’ll check code and merge your branches all at once at the end of the project.

Getting Started

Check out the Reddit dataset example notebook for a quick start guide, then have a look at the source code for the bgd.

My environment is under docs/requirements.txt, use pip install -r requirements. txt within a virtual (Conda etc.) environment to get everything installed.

Datasets

Documentation for the datsets currently in the Big Graph Dataset project.

ToP (Topology Only Pre-Training)

Documentation for the Topology Only Pre-Training component of the project. We are using a pre-trained model to generate embeddings of the graphs in the datasets, hopefully to get some measure of how diverse the datasets are. Very much a work-in-progress!

Credits

This project is maintained by Alex O. Davies, a PhD student at the University of Bristol. Contributors, by default, will be given fair credit upon initial release of the project.

Should you wish your authorship to be anonymous, or if you have any further questions, please contact me at <alexander.davies@bristol.ac.uk>.

Citing

@misc{big-graph-dataset,
title = {{Big Graph Dataset} Documentation},
howpublished = {https://big-graph-dataset.readthedocs.io/}}

Indices and tables

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

big_graph_dataset-0.0.8.post5.tar.gz (50.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

big_graph_dataset-0.0.8.post5-py3-none-any.whl (110.9 kB view details)

Uploaded Python 3

File details

Details for the file big_graph_dataset-0.0.8.post5.tar.gz.

File metadata

  • Download URL: big_graph_dataset-0.0.8.post5.tar.gz
  • Upload date:
  • Size: 50.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for big_graph_dataset-0.0.8.post5.tar.gz
Algorithm Hash digest
SHA256 39a9facbebe41c161db6104cff31403d44015a4a2c3b142155594bd0085230f3
MD5 6c22f5f75e21ce44364235d193676770
BLAKE2b-256 a784dd792996ec19cca4bc0fb0e97447493ee1d73c221b1de9cbf062e6f7d79b

See more details on using hashes here.

File details

Details for the file big_graph_dataset-0.0.8.post5-py3-none-any.whl.

File metadata

File hashes

Hashes for big_graph_dataset-0.0.8.post5-py3-none-any.whl
Algorithm Hash digest
SHA256 36e222a446e3805222dabd5d7362e0d2c4b42d9b35adf84eee60c21dfd3895bf
MD5 bb827d4cfdc5cf0ca8533667a09540ad
BLAKE2b-256 1648694b17004a47d0deb083477feb8d8f6f669403110a0627c01f0386279d7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page