Python implementation and extension of RDF2Vec
Project description
What is RDF2Vec?
RDF2Vec is an unsupervised technique that builds further on Word2Vec, where an embedding is learned per word, in two ways:
the word based on its context: Continuous Bag-of-Words (CBOW);
the context based on a word: Skip-Gram (SG).
To create this embedding, RDF2Vec first creates “sentences” which can be fed to Word2Vec by extracting walks of a certain depth from a Knowledge Graph.
This repository contains an implementation of the algorithm in “RDF2Vec: RDF Graph Embeddings and Their Applications” by Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim ([paper] [original code]).
Getting Started
Installation
pyRDF2Vec can be installed in two ways:
from PyPI using pip:
pip install pyRDF2vec
from any compatible Python dependency manager (e.g., poetry):
poetry add pyRDF2vec
Introduction
To create embeddings for a list of entities, there are two steps to do beforehand:
create a Knowledge Graph object;
define a walking strategy.
For a more elaborate example, check at the example.py file:
PYTHONHASHSEED=42 python3 example.py
NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.
Create a Knowledge Graph Object
To create a Knowledge Graph object, you can initialize it in two ways.
from a file using RDFlib:
from pyrdf2vec.graphs import KG
# Define the label predicates, all triples with these predicates
# will be excluded from the graph
label_predicates = ["http://dl-learner.org/carcinogenesis#isMutagenic"]
kg = KG(location="samples/mutag/mutag.owl", label_predicates=label_predicates)
from a server using SPARQL:
from pyrdf2vec.graphs import KG
kg = KG(location="https://dbpedia.org/sparql", is_remote=True)
Define Walking Strategies With Their Sampling Strategy
All supported walking strategies can be found on the Wiki page.
As the number of walks grows exponentially in function of the depth, exhaustively extracting all walks quickly becomes infeasible for larger Knowledge Graphs. In order to circumvent this issue, sampling strategies can be applied. These will extract a fixed maximum number of walks per entity. The walks are sampled according to a certain metric.
For example, if one wants to extract a maximum of 5 walks of depth 4 for each entity using the Random walking strategy and Uniform sampling strategy (SEE: the Wiki page for other sampling strategies), the following code snippet can be used:
from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker
walkers = [RandomWalker(4, 5, UniformSampler())]
Create Embeddings
Finally, the creation of embeddings for a list of entities simply goes like this:
from pyrdf2vec import RDF2VecTransformer
transformer = RDF2VecTransformer(walkers=[walkers], sg=1)
# Entities should be a list of URIs that can be found in the Knowledge Graph
embeddings = transformer.fit_transform(kg, entities)
Documentation
For more information on how to use pyRDF2Vec, visit our online documentation which is automatically updated with the latest version of the master branch.
From then on, you will be able to learn more about the use of the modules as well as their functions available to you.
Contributions
Your help in the development of pyRDF2Vec is more than welcome. In order to better understand how you can help either through pull requests and/or issues, please take a look at the CONTRIBUTING file.
Referencing
If you use pyRDF2Vec in a scholarly article, we would appreciate a citation:
@inproceedings{pyrdf2vec,
author = {Gilles Vandewiele and Bram Steenwinckel and Terencio Agozzino
and Michael Weyns and Pieter Bonte and Femke Ongenae
and Filip De Turck},
title = {{pyRDF2Vec: A python library for RDF2Vec}},
organization = {IDLab},
year = {2020},
url = {https://github.com/IBCNServices/pyRDF2Vec}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyrdf2vec-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9da8f1681f749d17b1d5c4218e3a7537d6ae300943dd7fe7e6f8d00a19ac50c6 |
|
MD5 | 6607651b43eaa3c4cd11182ba01e30ed |
|
BLAKE2b-256 | a992ba6e3a70ca0d62d011c18023aab6ab81ea13481630689e6b125c712fc5a9 |