Skip to main content

Datasets for the training of graph neural networks (GNNs) and subsequent visualization of attributional explanations of XAI methods

Project description

made-with-python python-version os-linux

Visual Graph Datasets

This package provides the possibility to manage a collection of datasets primarily for the training of graph neural networks. Each dataset is represented by one folder. Inside these folders each element of the dataset is represented by two files: (1) A metadata JSON file which contains the full graph representation as well as additional metadata such as the canonical index, the target value to be predicted etc… (2) A PNG image file which shows a domain specific illustration of the graph (molecular graphs for chemical datasets as an example). These additional visualizations of each graph can be used to easily visualize attributional graph XAI methods which assign importance values to each node and edge of the original input graph.

Motivation

Usually datasets are packaged as compact as possible. An example would be that chemical graph datasets are usually packaged as CSV files which only contain the index a SMILES representation of the molecule and the target value, looking something like this:

index, smiles, value
0, ccc, 0.24
1, ccc, 0.52
2, ccc, 1.77

This has the major advantage that even large datasets will have file sizes of only a few MB. These files are easy to download online and easy to store. The disadvantage however is that these files need to be processed to be usable to train graph neural networks (GNNs): The encoded SMILES representation first has to be transformed into a graph representation where node and edge features have to be generated by some kind of chemical pre-processor. Instead of putting the major storage and bandwidth requirements on the user, this puts the major processing requirements on the user. Additionally, this method places a greater burden on the visualization step of generated explanations.

Ultimately we decided to rather put the burden of downloading larger amounts of data on the user a single time in exchange of simplifying and reducing the burden of pre-processing and data visualization for each training process.

Additionally, by distributing both canonical indexing and canonical visualizations we aim to make explanation results more comparable in the future.

Installation

First clone this repository:

git clone https://github/username/visual_graph_datasets.git

Then install it like this:

cd visual_graph_datasets
pip3 install -e .

Download datasets

NOTE: We strongly encourage to store datasets on an SSD instead of an HDD, as this can make a difference of multiple hours(!) when loading especially large datasets.

Datasets can simply be downloaded by name by using the download command:

// Example for the dataset 'rb_dual_motifs'
python3 -m visual_graph_datasets.cli download "rb_dual_motifs"

By default this dataset will be downloaded into the folder $HOME/.visual_graph_datasets/datasets where HOME is the current users home directory.

The dataset download destination can be changed in a config file by using the config command:

python3 -m visual_graph_datasets.cli config

This command will open the config file at $HOME/.visual_graph_datasets/config.yaml using the systems default text editor.

List available datasets

You can display a list of all the currently available datasets of the current remote file share provider and some metadata information about them by using the command list:

python3 -m visual_graph_datasets.cli list

Running the unittests

After installation you can optionally run the unitests to confirm that all datasets have been correctly downloaded and that everything works properly:

cd visual_graph_datasets
pytest ./tests/*

Usage

The datasets are mainly intended to be used in combination with other packages, but this package provides some basic utilities to load and explore the datasets themselves within python programs.

from visual_graph_datasets.config import Config
from visual_graph_datasets.data import load_visual_graph_dataset

# The function only needs the absolute path to the dataset folder and will load all the entire datasets
# from all the files within that folder.
# The function returns two dictionaries. The first maps the string names of the elements to the content
# dictionaries and the second dict maps the integer indices of the elements to the very same content
# dictionaries. Two separate dictionaries are returned to provide different ways of accessing the data
# of the elements which are needed in different situations.
dataset_path = os.path.join(Config().get_datasets_path(), 'rb_dual_motifs')
data_name_map, data_index_map = load_visual_graph_dataset(dataset_path)

One such content dictionary which are the values of the two dicts returned by the function have the following nested dictionary structure:

  • image_path: The absolute path to the image file that visualizes this element

  • metadata_path: the absolute path to the metadata file

  • metadata: A dict which contains all the metadata for that element
    • value: The target value for the element, which can be a single value (usually with regression) or a one-hot vector for classification.

    • index: The canonical index of this element within the dataset

    • (split): If defined, either “train” or “test” - assignment for the canonical train test split

    • graph: A dictionary which contains the entire graph representation of this element.
      • node_attributes: tensor of shape (V, N)

      • edge_attributes: tensor of shape (E, M)

      • edge_indices: tensor of shape (E, 2) which are the tuples of integer node indices that determine edges

      • node_coordinates tensor of shape (V, 2) which are the xy positions of each node in pixel values within the corresponding image visualization of the element. This is the crucial information which is required to use the existing image representations to visualize attributional explanations!

With the following variable definitions:

  • V - the number of nodes in a graph

  • E - the number of edges in a graph

  • N - the number of node attributes / features associated with each node

  • M - the number of edge attributes / features associated with each edge

Datasets

Here is a list of the datasets currently included.

For more information about the individual datasets use the list command in the CLI (see above).

  • rb_dual_motifs

  • tadf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visual_graph_datasets-0.7.0.tar.gz (47.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

visual_graph_datasets-0.7.0-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file visual_graph_datasets-0.7.0.tar.gz.

File metadata

  • Download URL: visual_graph_datasets-0.7.0.tar.gz
  • Upload date:
  • Size: 47.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.6 Linux/5.15.0-56-generic

File hashes

Hashes for visual_graph_datasets-0.7.0.tar.gz
Algorithm Hash digest
SHA256 ec80946812521420c74d8c8056469293b08304129c2ec71c7102d85e11b3d556
MD5 b27c5c31b2b3eba1cf4790f0a39136a4
BLAKE2b-256 571a89fb664d8718a146c46dc1d7ac4b6891f65c0992cbb07fb4b1f8c70bed01

See more details on using hashes here.

File details

Details for the file visual_graph_datasets-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for visual_graph_datasets-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b37872ed073d02ff9d2b44b328cf086607f2a7267ed5a4dea3a9bc8750191f8
MD5 a5ec5f96ca81717851100786fabf842e
BLAKE2b-256 5e53e7f42b15fe32fa63e00fdedb5a8857d0a2c0abab63c40e52253d33aad497

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page