Skip to main content

GPU-Accelerated RDF2Vec – A high-performance GPU implementation of RDF2Vec that uses CUDA and RAPIDS to generate scalable, embeddings for large dense knowledge graphs.

Project description

gpuRDF2Vec

A scalable GPU based implementation of RDF2Vec embeddings for large and dense Knowledge Graphs.

License: MIT

RDF2VecGPU_Image

[!IMPORTANT] This package is under active development in the beta phase. The overall class/ method design will most probably change and introduce breaking changes between releases

Table of contents

The content of this repository readme can be found here:

Repository setup

The repository setup builds on top of two major libraries. Both Pytorch lightning as well as the RAPIDS libraries cuDF and cuGraph. We provide the exeplanatory installation details for Cuda 12.6:

  1. Pytorch installation page Cuda 12.6 installation
pip install torch torchvision torchaudio
  1. Detailed cudf installation instruction here. Cudf Cuda 12 install
pip install \
    --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu12==25.4.*" "dask-cudf-cu12==25.4.*" \
    "cugraph-cu12==25.4.*" "nx-cugraph-cu12==25.4.*" \
    "nx-cugraph-cu12==25.4.*"

The requirement files and conda environment files can be found here:

gpuRDF2Vec overview

RDF2Vec is a powerful technique to generate vector embeddings of entities in RDF graphs via random walks and Word2Vec. This repository provides a GPU-optimized reimplementation, enabling:

  • Speedups on dense graphs with millions of nodes
  • Scalability to industrial-scale knowledge bases
  • Reproducible experiments to test and qualify the overall implementation details

Repository Structure

.
├── README.md
├── data
├── data_preparation
│   ├── converstion_to_ttl.py
│   └── merge_text_file.py
├── img
│   └── github_repo_header.png
├── jrdf2vec-1.3-SNAPSHOT.jar
├── performance
│   ├── env_files
│      ├── jrdf2vec_environment.yml
│      ├── jrdf2vec_requirements.txt
│      ├── pyrdf2vec_environment.yml
│      ├── pyrdf2vec_requirements.txt
│      ├── rdf2vecgpu_environment.yml
│      ├── rdf2vecgpu_requirements.txt
│      ├── sparkrdf2vec_environment.yml
│      └── sparkrdf2vec_requirements.txt
│   ├── evaluation_parameters.py
│   ├── gpu_rdf2vec_performance.py
│   ├── graph_creation.py
│   ├── graph_statistics.py
│   ├── jrdf2vec_based_performance.py
│   ├── pyrdf2vec_based_performance.py
│   ├── spark_rdf2vec_performance.py
│   └── wandb_analysis.py
├── src
│   ├── __init__.py
│   ├── corpus
│      ├── __init__.py
│      └── walk_corpus.py
│   ├── cpu_based_rdf2vec_approach.py
│   ├── embedders
│      ├── __init__.py
│      ├── word2vec.py
│      └── word2vec_loader.py
│   ├── gpu_rdf2vec.py
│   ├── helper
│      ├── __init__.py
│      └── functions.py
│   └── reader
│       ├── __init__.py
│       └── kg_reader.py
└── test
    ├── helper
    └── reader
        ├── functions_test.py
        └── kg_reader_test.py

Capability overview

  • GPU-backed walk generation over CUDA Kernels
  • Batched Word2Vec training with Pytorch lightning
  • Pluggable rdf loaders and parquet, csv, txt integration
  • Performance comparison can be found in the following folder

Quick start

from src.gpu_rdf2vec import GPU_RDF2Vec
# Instantiate the gpu RDF2Vec library settings
gpu_rdf2vec_model = GPU_RDF2Vec(
    walk_strategy="random",
    walk_depth=4,
    walk_number=100,
    embedding_model="skipgram",
    epochs=5,
    batch_size=None,
    vector_size=100,
    window_size=5,
    min_count=1,
    learning_rate=0.01,
    negative_samples=5,
    random_state=42,
    reproducible=False,
    multi_gpu=False,
    generate_artifact=False,
    cpu_count=20
)
# Path to the triple dataset
path = "data/wikidata5m/wikidata5m_kg.parquet"
# Load data and receive edge data
edge_data = gpu_rdf2vec_model.load_data(path)
# Fit the Word2Vec model and transform the dataset to an embedding
embeddings = gpu_rdf2vec_model.fit_transform(edge_df=edge_data, walk_vertices=None)
# Write embedding to file format. Return format is a cuDf dataframe
embeddings.to_parquet("data/wikidata5m/wikidata5m_embeddings.parquet", index=False)
  • Supported file formats:
  • gpuRDF2Vec Parameters:
    • walk_strategy: [random, bfs]
    • walk_depth: int
    • walk_number: int
    • embedding_model: [skipgram, cbow]
    • epochs: int
    • batch_size: [None | int] --> If the batch size is None, we guess internally the batch size based on the data loader and the number of CPU counts provided
    • vector_size
    • window_size: int
    • min_count: int
    • learning_rate: float
    • negative_samples: int
    • random_state: int
    • reproducible: bool
    • multi_gpu: bool
    • generate_artifact: bool
    • cpu_count: int

Implementation Details

We achieve order-of-magnitude for large and dense graphs over CPU-bound RDF2Vec by engineering both the walk extraction and the Word2Vec training pipelines:

  1. GPU-Native Walk Extraction

    • All random-walk and BFS operations leverage cuDF/cuGraph kernels to avoid CPU–GPU data transfers and minimize latency.
    • To generate k walks per node in one pass, we replicate node indices in a single cuDF DataFrame rather than looping—fully utilizing GPU parallelism and eliminating Python-loop overhead (∼15× speedup).
    • BFS walks currently use GPU-side recursive joins; future work will reconstruct walks entirely in CUDA to remove join overhead.
  2. cuDF→PyTorch Lightning Handoff

    • Replaced Lightning’s default CPU-based DataLoader with a cuDF-backed pipeline: context/center columns live on GPU as DLPack tensors.
    • Initial deep-copy loads incur extra VRAM, but thereafter all sampling/preprocessing occurs on-device, eliminating PCIe stalls.
    • An “index-only” strategy (workers pull tensor indices instead of slices) uses CUDA’s pointer arithmetic for constant-time access, collapsing DataLoader overhead from ~85% of epoch time to near parity with model compute.
  3. Optimized Word2Vec Training

    • Batch-Size Heuristic: Estimate per-sample GPU footprint from cuDF loader, then set initial batch = (total VRAM) / (4 × footprint). This “divide-by-four” rule quickly homes in on a viable batch size, reducing tuning runs.
    • Kernel Fusion: All sampling and tensor transforms migrated into PyTorch’s C++ back end, removing Python loops and the GIL, for consistent high throughput.
  4. Scalable Data-Parallel Training

    • We use PyTorch Distributed + NCCL: each GPU holds the same graph shard but a unique walk corpus.
    • Gradients are synchronized via all_reduce at regular intervals (~500 ms), amortizing PCIe/NVLink costs and ensuring linear scaling across nodes.

License

The overview of the used MIT license can be found here

Roadmap

Report issues and bugs

In case you have found a bug or unexpected behaviour, please reach out by opening an issue:

  1. When opening an issue, please tag the issue with the label Bug. Please include the following information:

    • Environment: OS, Python/CUDA/PyTorch/RAPIDS versions (cuDF, cuGraph)
    • Reproduction steps: Exact commands or small code snippet
    • Input data graph format & size (attach a minimal sample if possible)
    • Observed vs. expected behavior
    • Error messages/ stack traces (copy-paste or attach logs)
  2. We aim to respond to open issues within 3 business days

  3. If you have identified a fix, fork the repo, branch off main, implement & test then open a PR referencing the issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdf2vecgpu-0.1.0.tar.gz (23.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdf2vecgpu-0.1.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file rdf2vecgpu-0.1.0.tar.gz.

File metadata

  • Download URL: rdf2vecgpu-0.1.0.tar.gz
  • Upload date:
  • Size: 23.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for rdf2vecgpu-0.1.0.tar.gz
Algorithm Hash digest
SHA256 86b7ba3c74c705d6b189e6e82961a1d2e50bf224589dce27677bb3d11b0f3c89
MD5 f34a2a84c50a9e7ecdef2b628662d1a2
BLAKE2b-256 6cf4cf2f4bb6ede1716c2c0a7f2a0338c3545af4b836ab710274de8c13291d2e

See more details on using hashes here.

File details

Details for the file rdf2vecgpu-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rdf2vecgpu-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for rdf2vecgpu-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1dec3d5fba6bd8f580effeae51fe1be5cc3d265933a0c08900ba714e9c8f7083
MD5 e79b5c613b7b14f2821d9d20ba507915
BLAKE2b-256 0210d2f570c89d17f7cde81ee9b08bdc5b5a325fad737367e0162734d0171d44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page