Python implementation and extension of RDF2Vec

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What is SQLiteKG2Vec?

SQLitKG2Vec is an experimental extension of the popular pyRDF2Vec library for training RDF2Vec embeddings. It might in the future be merged into the main project. This experimental extension stores the statements of the KG as well as the generated walks into a simple SQLite database. Hence, it is possible to train embeddings for huge knowledge graphs without running into memory issues.

RDF2Vec is an unsupervised technique that builds further on Word2Vec, where an embedding is learned per word, in two ways:

the word based on its context: Continuous Bag-of-Words (CBOW);
the context based on a word: Skip-Gram (SG).

To create this embedding, RDF2Vec first creates "sentences" which can be fed to Word2Vec by extracting walks of a certain depth from a Knowledge Graph.

This repository contains an implementation of the algorithm in "RDF2Vec: RDF Graph Embeddings and Their Applications" by Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim ([paper] [original code]).

Getting Started

For most uses-cases, here is how pySQLiteKG2Vec should be used to generate embeddings and get literals from a given Knowledge Graph (KG) and entities:

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs.io import open_from_pykeen_dataset
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec.walkers.vault.sqlitevault import SQLiteCorpusVaultFactory

with open_from_pykeen_dataset('dbpedia50') as kg:
    transformer = RDF2VecTransformer(
        Word2Vec(epochs=10),
        walkers=[RandomWalker(max_walks=200,
                              max_depth=4,
                              random_state=133,
                              with_reverse=False,
                              n_jobs=1)],
        vault_factory=SQLiteCorpusVaultFactory('corpus.db'),
        verbose=1
    )
    # train RDF2Vec
    ent = kg.entities()
    embeddings, _ = transformer.fit_transform(kg, ent)
    with open('embeddings.tsv', 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        for name, vector in kg.pack(ent, embeddings):
            writer.writerow([name] + [x for x in vector])

Create from PyKeen dataset

PyKeen is a popular library for knowledge graph embeddings, and it specifies a number of datasets that are commonly referenced in scientific literature. An SQLite KG can be constructed from a PyKeen dataset by specifying the name of the dataset or passing the dataset instance.

In the following code snippet, the db100k dataset, which is a subsampling of DBpedia, is used to construct an SQLite KG.

from pyrdf2vec.graphs.io import open_from_pykeen_dataset

with open_from_pykeen_dataset('db100k', combined=True) as kg:
    # ...
    pass

Parameters:

combined - False if only the training set of a dataset shall be used for the training of RDF2Vec. True if all the sets (training, testing and validation) shall be used. It is False by default.

Create from TSV file

In order to save memory for big knowledge graphs, it might be a good idea to load the statements of such a knowledge graph from a TSV file into a SQLite KG. All the rows in the TSV file must have three columns, where the first column is the subject, the second is the predicate, and the last column is the object.

The following code snippet creates a new SQLite KG instance from the statements of the specified TSV file, which has been compressed using GZIP.

from pyrdf2vec.graphs.io import open_from_tsv_file

with open_from_tsv_file('statements.tsv.gz', compression='gzip') as kg:
    # ...
    pass

Parameters:

skip_header - True if the first row shall be skipped, because it is a header row for example. False if it shouldn't be skipped. It is False by default.
compression - specifies the compression type of source TSV file. The default value is None, which means that the source isn't compressed. At the moment, only 'gzip' is supported as compression type.

Create from Pandas dataframe

A knowledge graph can be represented in a Pandas dataframe, and this method allows to create an SQLite KG from a dataframe. While the dataframe can have more than three columns, the three columns representing the subject, predicate and object must be specified in this particular order.

The following code snippet creates a new SQLite KG instance from a dataframe.

from pyrdf2vec.graphs.io import open_from_dataframe

with open_from_dataframe(df, column_names=('subj', 'pred', 'obj')) as kg:
    # ...
    pass

Parameters:

column_names - a tuple of three indices for the dataframe, which can be an integer or string. The first entry of the tuple shall point to the subject, the second to the predicate, and the third one to the object. (0, 1, 2) are the default indices.

Limitations

This extension has three limitations in contrast to the original implementation.

Literals are ignored by this implementation for now.
Samplers (besides the default one) might not work properly.

Installation

pySQLiteKG2Vec can be installed in three ways:

from PyPI using pip:

pip install pySQLiteKG2Vec

from any compatible Python dependency manager (e.g., poetry):

poetry add pyRDF2vec

from source:

git clone https://github.com/IBCNServices/pyRDF2Vec.git
pip install .

Documentation

For more information on how to use pyRDF2Vec, visit our online documentation which is automatically updated with the latest version of the main branch.

From then on, you will be able to learn more about the use of the modules as well as their functions available to you.

Contributions

Your help in the development of pyRDF2Vec is more than welcome.

architecture

The architecture of pyRDF2Vec makes it easy to create new extraction and sampling strategies, new embedding techniques. In order to better understand how you can help either through pull requests and/or issues, please take a look at the CONTRIBUTING file.

FAQ

How to Ensure the Generation of Similar Embeddings?

pySQLiteKG2Vec's walking strategies, sampling strategies and Word2Vec work with randomness. To get reproducible embeddings, you firstly need to use a seed to ensure determinism:

PYTHONHASHSEED=42 python foo.py

Added to this, you must also specify a random state to the walking strategy which will implicitly use it for the sampling strategy:

from pyrdf2vec.walkers import RandomWalker

RandomWalker(2, None, random_state=42)

NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.

Finally, to ensure random determinism for Word2Vec, you must specify a single worker:

from pyrdf2vec.embedders import Word2Vec

Word2Vec(workers=1)

NOTE: using the n_jobs and mul_req parameters does not affect the random determinism.

Why the Extraction Time of Walks is Faster if `max_walks=None`?

Currently, the BFS function (using the Breadth-first search algorithm) is used when max_walks=None which is significantly faster than the DFS function (using the Depth-first search algorithm) and extract more walks.

We hope that this algorithmic complexity issue will be solved for the next release of pyRDf2Vec

How to Silence the tcmalloc Warning When Using FastText With Mediums/Large KGs?

Sets the TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD environment variable to a high value.

Referencing

If you use pyRDF2Vec in a scholarly article, we would appreciate a citation:

@article{pyrdf2vec,
  title        = {pyRDF2Vec: A Python Implementation and Extension of RDF2Vec},
  author       = {Vandewiele, Gilles and Steenwinckel, Bram and Agozzino, Terencio and Ongenae, Femke},
  year         = 2022,
  publisher    = {arXiv},
  doi          = {10.48550/ARXIV.2205.02283},
  url          = {https://arxiv.org/abs/2205.02283},
  copyright    = {Creative Commons Attribution 4.0 International},
  organization = {IDLab},
  keywords     = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.0

Apr 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysqlitekg2vec-1.0.0.tar.gz (45.2 kB view hashes)

Uploaded Apr 2, 2023 Source

Built Distribution

pysqlitekg2vec-1.0.0-py3-none-any.whl (65.3 kB view hashes)

Uploaded Apr 2, 2023 Python 3

Hashes for pysqlitekg2vec-1.0.0.tar.gz

Hashes for pysqlitekg2vec-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`05cf9d432e8a4b1877c792a79052ed66e78e8959ccc1a069f04de1dc65c2adf7`
MD5	`6bf96cd392fd85c9157953f55cc0cb22`
BLAKE2b-256	`f9c0aed4d5c7c39d7adc51f06317d27ec38cf7561f7f630b22f4935e9e404418`

Hashes for pysqlitekg2vec-1.0.0-py3-none-any.whl

Hashes for pysqlitekg2vec-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d2475e1621feba433831b5bc4b5cd23746c4dad9f3b5c3029c3574f337708bc`
MD5	`392f454c993539707a808c10bcd73237`
BLAKE2b-256	`1a788b905000e18b4cdf603f42fc1914147ceb2cc664554199a0e33fff809d61`

pysqlitekg2vec 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

What is SQLiteKG2Vec?

Getting Started

Create from PyKeen dataset

Create from TSV file

Create from Pandas dataframe

Limitations

Installation

Documentation

Contributions

FAQ

How to Ensure the Generation of Similar Embeddings?

Why the Extraction Time of Walks is Faster if `max_walks=None`?

How to Silence the tcmalloc Warning When Using FastText With Mediums/Large KGs?

Referencing

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

pysqlitekg2vec 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

What is SQLiteKG2Vec?

Getting Started

Create from PyKeen dataset

Create from TSV file

Create from Pandas dataframe

Limitations

Installation

Documentation

Contributions

FAQ

How to Ensure the Generation of Similar Embeddings?

Why the Extraction Time of Walks is Faster if max_walks=None?

How to Silence the tcmalloc Warning When Using FastText With Mediums/Large KGs?

Referencing

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Why the Extraction Time of Walks is Faster if `max_walks=None`?