A Python package for using IntelliGraphs benchmarking datasets.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

IntelliGraphs: Benchmark Datasets for Knowledge Graph Generation

PyPI - Downloads

IntelliGraphs is a collection of benchmark datasets specifically for use in benchmarking generative models for knowledge graphs. You can learn more about it in our preprint: IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation. The Python package provides easy access to the datasets, along with pre- and post-processing functions, baseline models, and evaluation tools for benchmarking new models.

Installation

IntelliGraphs can be installed using either pip or conda, depending on your preferred package management system. The IntelliGraphs python package requires a number of dependencies which will also be installed during the installation process.

Install with pip:

pip install intelligraphs         # Standard pip
uv pip install intelligraphs     # Using UV (faster)

Install with conda:

conda install -c thiv intelligraphs

Verifying the Installation

After installation, you can verify that IntelliGraphs has been successfully installed by running the following command in your Python environment:

import intelligraphs

print(intelligraphs.__version__)

Downloading the Datasets

The datasets required for this project can be obtained either manually or automatically through IntelliGraphs Python package.

Manual Download

The datasets are hosted on Zenodo: https://doi.org/10.5281/zenodo.14787483

You can download the datasets and extract the files to your preferred directory.

Automatic Dataset Download

Use the DatasetDownloader class to download, verify, and extract datasets automatically.

from intelligraphs.data_loaders.downloader import DatasetDownloader  

downloader = DatasetDownloader()
downloader.download_and_verify_all()

This downloads, verifies integrity using MD5 checksums, and extracts datasets to the .data directory. To change the download directory, specify a custom path when initializing DatasetDownloader:

downloader = DatasetDownloader(download_dir=".custom_data_directory")

IntelliGraphs Data Loader

The IntelliGraphsDataLoader class is a utility for loading IntelliGraphs datasets, simplifying the process of accessing and organizing the data for machine learning tasks. It provides functionalities to download, extract, and load the datasets into PyTorch tensors.

Usage

Instantiate the DataLoader:

from intelligraphs import IntelliGraphsDataLoader
data_loader = IntelliGraphsDataLoader(dataset_name='syn-paths')

Load the Data:

train_loader, valid_loader, test_loader = data_loader.load_torch(
    batch_size=32,
    padding=True,
    shuffle_train=False,
    shuffle_valid=False,
    shuffle_test=False
)

Access the Data:

for batch in train_loader:
    # Perform training steps with the batch

for batch in valid_loader:
    # Perform validation steps with the batch

for batch in test_loader:
    # Perform testing steps with the batch

IntelliGraphs Synthetic KG Generator

`SynPathsGenerator`

This generator creates path graphs where each node represents a city in the Netherlands and each edge represents a mode of transport (cycle_to, drive_to, train_to).

Entities: Dutch cities
Relations: Modes of transport between cities
Use case: Structural learning

`SynTIPRGenerator`

This generator creates graphs representing academic roles, timelines, and people. The nodes represent individuals, roles, and years, and the edges represent relationships like has_name, has_role, start_year, and end_year.

Entities: Names, roles, years
Relations: Relationships between academic roles and timeframes
Use case: Basic temporal reasoning and type checking

`SynTypesGenerator`

This generator creates graphs where nodes represent countries, languages, and cities, and edges represent relationships like spoken_in, part_of, and same_as.

Entities: Countries, languages, cities
Relations: Geographical and linguistic relationships
Use case: Type checking

Customization

Each generator class inherits from BaseSyntheticDatasetGenerator and can be customized by overriding methods or adjusting parameters. The base class provides utility methods for splitting datasets, checking for unique graphs, and visualizing graphs.

Extending Functionality

To create a new dataset generator, simply create a new class that inherits from BaseSyntheticDatasetGenerator and implement the sample_synthetic_data method to define your dataset's logic.

class MyCustomDatasetGenerator(BaseSyntheticDatasetGenerator):
    def sample_synthetic_data(self, num_graphs):
        # Implement your custom logic here
        pass

Data Generation

You can generate synthetic datasets by running the corresponding script for each generator. Each generator allows customization of dataset size, random seed, and other parameters.

python intelligraphs/generator/synthetic/synpaths_generator.py --train_size 60000 --val_size 20000 --test_size 20000 --num_edges 3 --random_seed 42 --dataset_name "syn-paths"
python intelligraphs/generator/synthetic/syntypes_generator.py  --train_size 60000 --val_size 20000 --test_size 20000 --num_edges 3 --random_seed 42 --dataset_name "syn-types"
python intelligraphs/generator/synthetic/syntipr_generator.py --train_size 50000 --val_size 10000 --test_size 10000 --num_edges 3 --random_seed 42 --dataset_name "syn-tipr"

IntelliGraphs Verifier

Rules

Every dataset comes with a set of rules that describe the nature of the graphs. The ConstraintVerifier class includes a convenient method called print_rules() that allows you to view all the rules and their descriptions in a clean and organized format.

To use the print_rules() method, simply instantiate a subclass of ConstraintVerifier, such as SynPathsVerifier, and then call the print_rules() method on that instance to list the logical rules for a given dataset.

Example Usage

from intelligraphs.verifier.synthetic import SynPathsVerifier

# Initialize the verifier for the syn-paths dataset
verifier = SynPathsVerifier()

# Print the rules and their descriptions for the syn-paths dataset
verifier.print_rules()

When you call print_rules(), you'll get a formatted list of all the rules along with their corresponding descriptions. For example:

List of Rules and Descriptions:
        -> Rule 1:
           FOL: ∀x, y, z: connected(x, y) ∧ connected(y, z) ⇒ connected(x, z)
           Description: Ensures transitivity. If x is connected to y, and y is connected to z, then x should be connected to z.
        -> Rule 2:
           FOL: ∀x, y: edge(x, y) ⇒ connected(x, y)
           Description: If there's an edge between two nodes x and y, then x should be connected to y.
        ...

Baseline Models

Importing Baseline Models

Our baseline models are also available through the Python API. You can find them inside baseline_models class.

To import the Uniform Baseline model:

from intelligraphs.baseline_models import UniformBaseline

To import the Knowledge Graph Embedding (KGE) models:

from intelligraphs.baseline_models.knowledge_graph_embedding_model import KGEModel

Setup

To recreate our experiments, you will need to recreate the virtual environments with the required dependencies.

1. Create and activate a new conda environment:

First, create a dedicated environment named intelligraph_baseline with Python 3.10:
```
conda create -n intelligraph_baseline python=3.10
```
Activate the newly created environment:
```
conda activate intelligraph_baseline
```

2. Install the `intelligraphs` package:

To install the intelligraphs package, choose one of the following methods:

pip install -e .

pip install intelligraphs

conda install -c thiv intelligraphs

3. Install additional dependencies for running baselines:

The baseline experiments require some extra Python packages. These packages include popular libraries such as PyTorch (for machine learning), PyYAML (for configuration file management), tqdm (for progress bars), Weights & Biases (for experiment tracking), NumPy, and SciPy. Install them using:
```
pip install torch pyyaml tqdm wandb numpy scipy
```

4. Set up Weights & Biases (wandb) for experiment tracking:

We use Weights & Biases (wandb) to track experiments, log metrics, and visualize results.
To start using wandb, create an account on their platform and log in from the command line:
```
wandb login
```
If you prefer not to use wandb for tracking, you can disable it by setting wandb to offline mode:
```
wandb offline
```
When set to offline, wandb will not sync any data to the cloud, but you can still run experiments locally.

Uniform Baseline Model

The uniform baseline model is designed to serve as a simple reference baseline. It applies a random compression strategy for the synthetic and real-world datasets. You can run this baseline using the following commands:

python benchmark/experiments/uniform_baseline_compression_test.py

It should complete in about a minute without any GPU-acceleration.

To run the graph sampling experiment using the uniform sampler, run the command:

python benchmark/experiments/uniform_baseline_graph_sampling.py

Probabilistic KGE Models

We have developed three probabilistic Knowledge Graph Embedding (KGE) models based on TransE, DistMult, and ComplEx. These models are CUDA-compatible and can take advantage of multiple GPUs for accelerated training and inference.

The following commands allow you to run baseline experiments on various synthetic and real-world datasets. Each experiment is configured through a YAML file, which specifies the model and dataset parameters.

For the `syn-paths` dataset:

python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-paths-transe.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-paths-complex.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-paths-distmult.yaml

For the `syn-tipr` dataset:

python experiments/train_baseline.py  --config benchmark/configs/syn-tipr-transe.yaml
python experiments/train_baseline.py  --config benchmark/configs/syn-tipr-complex.yaml
python experiments/train_baseline.py  --config benchmark/configs/syn-tipr-distmult.yaml

For the `syn-types` dataset:

python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-types-transe.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-types-complex.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/syn-types-distmult.yaml

For the `wd-articles` dataset:

python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-articles-transe.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-articles-complex.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-articles-distmult.yaml

For the `wd-movies` dataset:

python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-movies-transe.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-movies-complex.yaml
python benchmark/experiments/probabilistic_kge_baselines.py  --config benchmark/configs/wd-movies-distmult.yaml

Dataset Verification

We have written test functions to check the graphs in the datasets against the list of rules. It can be run using:

python intelligraphs/data_validation/validate_data.py

If there are any errors in the data, it will raise a DataError exception and the error message will look similar to this:

intelligraphs.errors.custom_error.DataError: Violations found in a graph from the training dataset: 
        - Rule 6: An academic's tenure end year cannot be before its start year. The following violation(s) were found: (_time, start_year, 1996), (_time, end_year, 1994).

How to Cite

If you use IntelliGraphs in your research, please cite the following paper:

@article{thanapalasingam2023intelligraphs,
  title={IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation},
  author={Thanapalasingam, Thiviyan and van Krieken, Emile and Bloem, Peter and Groth, Paul},
  journal={arXiv preprint arXiv:2307.06698},
  year={2023}
}

Reporting Issues

If you encounter any bugs or have any feature requests, please file an issue here.

License

IntelliGraphs datasets and the python package is licensed under CC-BY License. See LICENSE for more information.

Platform Compatibility/Issues

This package has been and developed and tested on MacOS and Linux operating systems. If you experience any problems on Windows or any other issues, please raise the issue on the project's GitHub repository.

Unit tests

Make sure to activate the virtual environment with the installation of the intelligraphs package.

To run the unit tests, install pytest:

pip install pytest scipy or conda install pytest scipy

pytest --version  # verify installation

Execute the units tests using:

pytest

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

1.0.18

Feb 25, 2025

1.0.17

Feb 25, 2025

1.0.15

Feb 19, 2025

1.0.14

Feb 5, 2025

1.0.12

Feb 5, 2025

This version

1.0.11

Feb 5, 2025

1.0.10

Feb 1, 2025

1.0.9

Feb 1, 2025

1.0.8

Feb 1, 2025

1.0.7

Feb 1, 2025

1.0.6

Feb 1, 2025

1.0.5

Feb 1, 2025

1.0.4

Feb 1, 2025

1.0.3

Feb 1, 2025

0.1.1

Apr 12, 2023

0.1.0

Apr 12, 2023

0.0.3

Feb 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelligraphs-1.0.11.tar.gz (672.9 kB view details)

Uploaded Feb 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

intelligraphs-1.0.11-py3-none-any.whl (709.1 kB view details)

Uploaded Feb 5, 2025 Python 3

File details

Details for the file intelligraphs-1.0.11.tar.gz.

File metadata

Download URL: intelligraphs-1.0.11.tar.gz
Upload date: Feb 5, 2025
Size: 672.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for intelligraphs-1.0.11.tar.gz
Algorithm	Hash digest
SHA256	`fd75e482959da8cbe2776e8d46dacc3a366a82becec2fb9a4dd7f53b5dcdf983`
MD5	`b3449f5c708d44d050e741597ea918e5`
BLAKE2b-256	`78a86a71fc1abc057315108f0a27689850e89640dd68a28bfff53aae628571f7`

See more details on using hashes here.

File details

Details for the file intelligraphs-1.0.11-py3-none-any.whl.

File metadata

Download URL: intelligraphs-1.0.11-py3-none-any.whl
Upload date: Feb 5, 2025
Size: 709.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for intelligraphs-1.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83970ff713f401d9155186a706a95227e4975d5869061a558488f9a90196d2cf`
MD5	`38f85f4ccebb394de16a13ab5823e463`
BLAKE2b-256	`25a50a094b6f2381b92d24c1524926ac81a1c66ed1b307e80ee7fc130314c867`

See more details on using hashes here.

intelligraphs 1.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

IntelliGraphs: Benchmark Datasets for Knowledge Graph Generation

Installation

Install with pip:

Install with conda:

Verifying the Installation

Downloading the Datasets

Manual Download

Automatic Dataset Download

IntelliGraphs Data Loader

Usage

IntelliGraphs Synthetic KG Generator

SynPathsGenerator

SynTIPRGenerator

SynTypesGenerator

Customization

Extending Functionality

Data Generation

IntelliGraphs Verifier

Rules

Example Usage

Baseline Models

Importing Baseline Models

Setup

1. Create and activate a new conda environment:

2. Install the intelligraphs package:

3. Install additional dependencies for running baselines:

4. Set up Weights & Biases (wandb) for experiment tracking:

Uniform Baseline Model

Probabilistic KGE Models

For the syn-paths dataset:

For the syn-tipr dataset:

For the syn-types dataset:

For the wd-articles dataset:

For the wd-movies dataset:

Dataset Verification

How to Cite

Reporting Issues

License

Platform Compatibility/Issues

Unit tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SynPathsGenerator`

`SynTIPRGenerator`

`SynTypesGenerator`

2. Install the `intelligraphs` package:

For the `syn-paths` dataset:

For the `syn-tipr` dataset:

For the `syn-types` dataset:

For the `wd-articles` dataset:

For the `wd-movies` dataset: