Skip to main content

A Store back-end for rdflib to allow for reading and querying HDT documents

Project description

rdflib-htd logo

Build Status PyPI version

A Store back-end for rdflib to allow for reading and querying HDT documents.

Online Documentation

Requirements

  • Python version 3.6.4 or higher

  • pip

  • gcc/clang with c++11 support

  • Python Development headers ..

    You should have the Python.h header available on your system.For example, for Python 3.6, install the python3.6-dev package on Debian/Ubuntu systems.

Installation

Installation using pipenv or a virtualenv is strongly advised!

Manual installation

Requirement: pipenv

git clone https://github.com/Callidon/pyHDT
cd pyHDT/
./install.sh

Getting started

You can use the rdflib-hdt library in two modes: as an rdflib Graph or as a raw HDT document.

HDT Document usage

from rdflib_hdt import HDTDocument

# Load an HDT file. Missing indexes are generated automatically.
# You can provide the index file by putting them in the same directory than the HDT file.
document = HDTDocument("test.hdt")

# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {document.total_triples}")
print(f"Number of subjects: {document.nb_subjects}")
print(f"Number of predicates: {document.nb_predicates}")
print(f"Number of objects: {document.nb_objects}")
print(f"Number of shared subject-object: {document.nb_shared}")

# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
triples, cardinality = document.search_triples((None, FOAF("name"), None))

print(f"Cardinality of (?s foaf:name ?o): {cardinality}")
for s, p, o in triples:
  print(triple)

# The search also support limit and offset
triples, cardinality = document.search_triples((None, FOAF("name"), None), limit=10, offset=100)
# etc ...

An HDT document also provides support for evaluating joins over a set of triples patterns.

from rdflib_hdt import HDTDocument
from rdflib import Variable
from rdflib.namespace import FOAF, RDF

document = HDTDocument("test.hdt")

# find the names of two entities that know each other
tp_a = (Variable("a"), FOAF("knows"), Variable("b"))
tp_b = (Variable("a"), FOAF("name"), Variable("name"))
tp_c = (Variable("b"), FOAF("name"), Variable("friend"))
query = set([tp_a, tp_b, tp_c])

iterator = document.search_join(query)
print(f"Estimated join cardinality: {len(iterator)}")

# Join results are produced as ResultRow, like in the RDFlib SPARQL API
for row in iterator:
   print(f"{row.name} knows {row.friend}")

Handling non UTF-8 strings in python

If the HDT document has been encoded with a non UTF-8 encoding the previous code won’t work correctly and will result in a UnicodeDecodeError. More details on how to convert string to str from C++ to Python here

To handle this, we doubled the API of the HDT document by adding:

  • search_triples_bytes(...) return an iterator of triples as (py::bytes, py::bytes, py::bytes)

  • search_join_bytes(...) return an iterator of sets of solutions mapping as py::set(py::bytes, py::bytes)

  • convert_tripleid_bytes(...) return a triple as: (py::bytes, py::bytes, py::bytes)

  • convert_id_bytes(...) return a py::bytes

Parameters and documentation are the same as the standard version

from rdflib_hdt import HDTDocument

document = HDTDocument("test.hdt")
it = document.search_triple_bytes("", "", "")

for s, p, o in it:
print(s, p, o) # print b'...', b'...', b'...'
# now decode it, or handle any error
try:
   s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
except UnicodeDecodeError as err:
   # try another other codecs, ignore error, etc
   pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdflib_hdt-3.0.tar.gz (235.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdflib_hdt-3.0-cp37-cp37m-macosx_10_14_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.7mmacOS 10.14+ x86-64

File details

Details for the file rdflib_hdt-3.0.tar.gz.

File metadata

  • Download URL: rdflib_hdt-3.0.tar.gz
  • Upload date:
  • Size: 235.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.7

File hashes

Hashes for rdflib_hdt-3.0.tar.gz
Algorithm Hash digest
SHA256 064a009692dbc298076e2b1a1a82f310b70efdf929032cfd4bd96a9278c47db4
MD5 33a562926bc931a798ea3cbcd62652ca
BLAKE2b-256 7b026d591bc1e312f98f7c65a4a65c0068030d6f89fe540452da7638e03d7d18

See more details on using hashes here.

File details

Details for the file rdflib_hdt-3.0-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: rdflib_hdt-3.0-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.7

File hashes

Hashes for rdflib_hdt-3.0-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 caf8517f0a4df338b6e94fe87e718d363bbe0b73a327311f239b5c7c68a0824e
MD5 62d3d04e7ef03bc649d2a89d3f18e210
BLAKE2b-256 3168878862317bc4eb0d54f9ca0fc6a6dfc65c2729214f014043a1bf991279a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page