Skip to main content

A tool for uploading RDF data to SPARQL endpoints

Project description

RDF Uploader

A tool for uploading RDF data to different types of triple stores with consistent behavior across different endpoint types.

RDF Uploader Demo

PyPI - Version Docker Image Version (latest by date) Tests codecov Python Version License MIT

Recently, I've been working extensively with knowledge graphs, which often involves work with different types of triple stores like MarkLogic, Blazegraph, RDFox,AWS Neptune, and StarDog at the same time. Each store comes with a different bulk-loading process, its own endpoint URLs, authentication rules, and named-graph conventions. Dealing with those differences became very tedious very quickly.

To get rid of the tedium, I built RDF-Uploader — a tool that streamlines the workflow and offers a consistent, high-performance method for uploading large RDF datasets.

Table of Contents

Rationale

To understand why uploading to RDF stores is annoying, lets look at different methods to do it.

  • Use RDFLib’s store.update method.

    This approach relies on the standard SPARQL Update protocol, so it will usually work with any triple store. However, it is the slowest option. The simplest usage pattern is easy to write; however, calling store.update in a loop for one triple at a time is painfully inefficient. Stores like AWS Neptune take roughly the same time to ingest a single triple as they do a batch of a thousand, and the difference is even greater with high-performance engines such as RDFox.

  • Use proprietary method recommended by the store.

    Most triple stores implement their own proprietary bulk-loading tools, and they’re usually far faster than looping over RDFLib’s store.update. The catch is that every vendor does it differently. AWS Neptune, for example, ingests data from an S3 bucket, while Blazegraph expects a file that already lives on the server's local disk.

    When your project has to target several stores at once, juggling these loader-specific workflows quickly becomes painful. Each path demands extra code to stage the files—either uploading to S3 or copying them onto the server—and each path requires additional permissions for developers and CI pipelines. In many organizations, granting that level of access simply isn’t feasible.

  • Use CURL to post data to a bulk endpoint

    Almost all triple stores provide an HTTP endpoint for bulk loading data. Either via standard Graph Store Protocol or through some proprietary means like Stardog's CLI. This method is performant and doesn't require setting up any special access. However, there are a few challenges to this method as well. First, the actual implementation of the endpoint is different for different stores. Some support the standard protocols, some implement their own. Second, using CURL implies loading all the data in a single transaction. This is OK when the dataset is rather small, but all the stores have a limit of how much data they could receive at a time. For some stores this limit is pretty large, but nevertheless it is always finite and always much smaller that the limit of the number of triples the store can handle. Also, if an error occurs during the batch upload of a very large data block the entire transaction has to be repeated.

    The later limitation can be mitigated by splitting the dataset into smaller parts and posting them to the triple store separately. But this has to be done either manually through tedious and error prone process, or by developing a complex program to automate the splitting.

To make the data loading experience less annoying, I created this tool. It combines the advantages of the above methods while eliminating their downsides.

Features

  • Ingest RDF data into SPARQL endpoints using asynchronous operations
  • Support for multiple RDF stores (MarkLogic, Blazegraph, Neptune, RDFox, and Stardog)
  • Authentication support for secure endpoints
  • Content type detection and customization
  • Concurrent uploads with configurable limits
  • Batching of RDF statements for efficient processing
  • Verbose output for detailed logging
  • Support for named graphs

Installation & Quick Start

Choose your preferred method:

pip

pip install rdf-uploader
rdf-uploader file.ttl --endpoint http://localhost:3030/dataset/sparql

pipx (without permanent installation)

pipx run rdf-uploader upload file.ttl --endpoint http://localhost:3030/dataset/sparql

Homebrew

The homebrew formula for rdf-uploader lives in the private tap vladistan/homebrew-gizmos This separate tap is required because the package is still new and hasnt yet met the popularity and stability thresholds for inclusion in homebrew-core. Use the following commands to install it from the private tap.

brew tap vladistan/homebrew-gizmos
brew install rdf-uploader

# Quick test
rdf-uploader file.ttl --endpoint http://localhost:3030/dataset/sparql

Docker

docker run -v $(pwd):/data vladistan/rdf-uploader:latest /data/file.ttl --endpoint http://localhost:3030/dataset/sparql

With Environment Variables

export RDF_ENDPOINT=http://localhost:3030/dataset/sparql
rdf-uploader file.ttl

With .envrc File

Create .envrc with your configuration, then run:

# .envrc file content
export RDF_ENDPOINT="http://localhost:3030/dataset/sparql"

# Command to run
rdf-uploader file.ttl

Usage Guide

Basic Operations

Upload a single file:

To upload a single file just specify it's name, endpoint URL and the endpoint type

rdf-uploader upload file.ttl --endpoint http://localhost:3030/dataset/sparql --type blazegraph

The following endpoint types are supported

  • marklogic
  • neptune
  • blazegraph
  • rdfox
  • stardog
  • generic (default)

Upload multiple files:

You can upload multiple files at once

rdf-uploader upload file1.ttl file2.n3 --endpoint http://localhost:3030/dataset/sparql --type blazegraph

Use a named graph:

If you need to upload to a specific named graph, you can use --graph option.

rdf-uploader upload poke-a.nq --endpoint https://crystalia.us-east-1.neptune.amazonaws.com:8182/sparql  --type neptune  --graph urn:default

Authentication

With credentials:

If the store requires authentication, you can pass them on a command line

rdf-uploader upload file.ttl --endpoint http://localhost:3030/dataset/sparql --username myuser --password mypass

However, it is better to configure credentials using configuration file or environment variables (see below)

Content Types & Format

Normally, the tool tries to determine the content type of the file. Below is the list of recognized content types and extensions.

Supported formats

  • .ttl, .turtle: text/turtle
  • .nt: application/n-triples
  • .n3: text/n3
  • .nq, .nquads: application/n-quads
  • .rdf, .xml: application/rdf+xml
  • .jsonld: application/ld+json
  • .json: application/rdf+json
  • .trig: application/trig

Explicitly specify content type:

You can also specify the content type explicitly

rdf-uploader upload file.ttl --content-type "text/turtle"

Performance Options

Control concurrency:

The --concurrent option allows you to specify the number of concurrent upload operations. For example, using --concurrent 10 will enable the uploader to process up to 10 files simultaneously, which can significantly speed up the upload process when dealing with multiple files.

rdf-uploader upload *.ttl --concurrent 10

Enable verbose output:

The --verbose option provides detailed output during the upload process. This can be useful for debugging or monitoring the progress of the upload, as it will display additional information about each step the uploader takes.

rdf-uploader upload file.ttl --verbose

Set batch size:

The --batch-size option lets you define the number of RDF statements to be included in each batch during the upload. For instance, --batch-size 5000 will group the RDF data into batches of 5000 statements, which can help manage memory usage and optimize performance for large datasets.

rdf-uploader upload file.ttl --batch-size 5000

Configuration

RDF Uploader offers three ways to configure parameters, with the following priority:

  1. Command-line arguments (highest priority)
  2. Environment variables (checked if CLI args not provided)
  3. .envrc file (checked if environment variables not set)

Command Line Options Reference

Category Option Short Description Default
Files FILES... One or more RDF files to upload (required)
Endpoint --endpoint -e SPARQL endpoint URL (required)
--type -t Endpoint type generic
--graph -g Named graph to upload to Default graph
--store-name -s RDFox datastore name (required for RDFox)
Auth --username -u Username
--password -p Password
Content --content-type Content type for RDF data Auto-detected
Performance --concurrent -c Max concurrent uploads 5
--batch-size -b Triples per batch 1000
Output --verbose -v Enable detailed output False

Environment Variables

Frequently used options can be put in the environment variables, thus making your CLI commands much shorter and reducing the risk of exposing credentials

RDF Uploader supports two categories of environment variables: generic and endpoint-specific. Endpoint-specific variables (prefixed with the endpoint type, like MARKLOGIC_ENDPOINT) are tailored to particular triple store implementations and are checked first. If these specific variables aren't found, the uploader falls back to generic variables (prefixed with RDF_, like RDF_ENDPOINT) which apply to all endpoint types. This hierarchical approach allows you to configure default credentials and parameters while maintaining the ability to override settings for specific endpoint types when needed.

Below is the list of recognized environment variables

General Configuration

# Generic endpoint URL and auth
export RDF_ENDPOINT=http://localhost:3030/dataset/sparql
export RDF_USERNAME=myuser
export RDF_PASSWORD=mypass

Endpoint-specific Configuration

# MarkLogic
export MARKLOGIC_ENDPOINT=http://marklogic-server:8000/v1/graphs
export MARKLOGIC_USERNAME=mluser
export MARKLOGIC_PASSWORD=mlpass

# Neptune
export NEPTUNE_ENDPOINT=https://your-neptune-instance.amazonaws.com:8182/sparql
export NEPTUNE_USERNAME=neptuneuser
export NEPTUNE_PASSWORD=neptunepass

# Blazegraph
export BLAZEGRAPH_ENDPOINT=http://blazegraph-server:9999/blazegraph/sparql
export BLAZEGRAPH_USERNAME=bguser
export BLAZEGRAPH_PASSWORD=bgpass

# RDFox
export RDFOX_ENDPOINT=http://rdfox-server:12110/datastores/default/content
export RDFOX_USERNAME=rdfoxuser
export RDFOX_PASSWORD=rdfoxpass
export RDFOX_STORE_NAME=mystore

# Stardog
export STARDOG_ENDPOINT=https://your-stardog-instance:5820/database
export STARDOG_USERNAME=sduser
export STARDOG_PASSWORD=sdpass

Using .envrc file

For a more convenient development workflow, you can use a .envrc file to store your environment variables. This is especially useful when working with multiple projects that require different configurations.

The .envrc file should be placed in your project's root directory. When using RDF Uploader, it will automatically look for this file and load the variables if neither command-line options nor environment variables are set.

Example .envrc file:

export RDF_ENDPOINT=http://localhost:3030/dataset/sparql
export RDF_USERNAME=myuser
export RDF_PASSWORD=mypass

# MarkLogic configuration
export MARKLOGIC_ENDPOINT=http://marklogic-server:8000/v1/graphs
export MARKLOGIC_USERNAME=mluser
export MARKLOGIC_PASSWORD=mlpass

# Performance options
export RDF_BATCH_SIZE=10000
export RDF_WORKERS=4
export RDF_TIMEOUT=300

Programmatic Usage

Instead of using RDF-Uploader as a CLI tool you can directly integrate it to your Python project This allows you to programmatically upload RDF files as they are being generated by your code. Like the CLI tool the library method accepts parameters either explicitly or indirectly from the environment variables.

from pathlib import Path
from rdf_uploader.uploader import upload_rdf_file
from rdf_uploader.endpoints import EndpointType

# With explicit parameters
await upload_rdf_file(
    file_path=Path("file.ttl"),
    endpoint="http://localhost:3030/dataset/sparql",
    endpoint_type=EndpointType.GENERIC,
    username="myuser",
    password="mypass"
)

# Using environment variables
await upload_rdf_file(
    file_path=Path("file.ttl"),
    endpoint_type=EndpointType.GENERIC
)

License

This project is licensed under the MIT License - see the LICENSE file for details.


⭐️ If you find this repository helpful, please consider giving it a star!

Keywords: RDF, Knowledge Graphs, Graph Databases, AI, Triple Stores

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdf_uploader-0.18.8.tar.gz (7.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdf_uploader-0.18.8-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file rdf_uploader-0.18.8.tar.gz.

File metadata

  • Download URL: rdf_uploader-0.18.8.tar.gz
  • Upload date:
  • Size: 7.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for rdf_uploader-0.18.8.tar.gz
Algorithm Hash digest
SHA256 0618ff575c766d1bfbb3563e3fa73eda49378bfcf0ae37bbf30a502ba446cfd7
MD5 a107adccb8e81a40a0c48e73108f58eb
BLAKE2b-256 45aacd0ae3f785996428d86b52ff4a2d9bcca389bdba8437488b219e1db9e4d9

See more details on using hashes here.

File details

Details for the file rdf_uploader-0.18.8-py3-none-any.whl.

File metadata

File hashes

Hashes for rdf_uploader-0.18.8-py3-none-any.whl
Algorithm Hash digest
SHA256 a113053c7a7932e9e09f9b30c52d287c8687137f3f28b93180a31ea0cfd07460
MD5 5067212f71cdab126b3b997dce60cb01
BLAKE2b-256 eb9839c160fc66aadc3c3fd8e720139548a118e0eed8b264429c12d02c229d57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page