De-duplicate RDF triples w/ a SPARQL query. Subjects taken from SELECT are replaced by the hash of their triples '{predicate} {object}. ' pairs sorted.
Project description
rdfhash: RDF Graph Hashing/Compression Tool
rdfhash is a utility for RDF graph compression that works by hashing RDF subjects based on a checksum of their triples, effectively minimizing the size of RDF graphs by consolidating subjects that have identical definitions.
Installation
Using pip
You can install rdfhash using pip, a package manager for Python. Ensure python and pip are properly installed on your system, then run the following command:
pip install rdfhash
# Test the installation
rdfhash --help
Using UV (Recommended for Development)
If you're working with this repository for development or want faster dependency management, you can use UV, a fast Python package installer and resolver:
# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install the package and its dependencies
uv pip install -e .
# For development with additional dev dependencies
uv pip install -e ".[dev]"
# Or use UV to manage the project directly
uv sync
# Test the installation
uv run rdfhash --help
# To run pytest
uv run pytest
Usage
Command Line Interface (CLI)
Basic Usage
By default, all blank nodes in a text/turtle file or string are replaced by their hashed definition:
rdfhash '
@prefix hash: <http://rdfhash.com/ontology/> .
[ ] a hash:Attribute ;
hash:unit hash:unit:Centimeters ;
hash:value 5.38 .'
Output:
@prefix hash: <http://rdfhash.com/ontology/> .
<sha256:960891b4b1856b4d2c24b977f75d497e4da9e6f147a292524ae51db5fd0e864e>
a hash:Attribute ;
hash:unit <http://rdfhash.com/ontology/unit:Centimeters> ;
hash:value 5.38 .
Advanced Usage
The rdfhash tool is highly customizable and can be tailored to fit the requirements of any organization:
rdfhash '
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix md5: <http://rdfhash.com/instances/md5/> .
[ ] a hash:Contact ;
hash:phone "487-538-2824" ;
hash:email "johnsmith@example.com" ;
hash:name [
a hash:LegalName ;
hash:firstName "John" ;
hash:lastName "Smith" ;
] ;
hash:address [
a hash:Address ;
hash:street "4567 Mountain Peak Way" ;
hash:city "Denver" ;
hash:state "CO" ;
hash:zip "80202" ;
hash:country "USA" ;
] ;
.' \
--method md5 \
--template 'http://rdfhash.com/instances/{method}/{value}' \
--sparql '
prefix hash: <http://rdfhash.com/ontology/>
select ?s where {
?s a ?type .
VALUES ?type {
hash:Contact
hash:LegalName
hash:Address
}
}'
--methodspecifies the hashing algorithm to use. The default issha256.--templatespecifies the URI template to use for hashed subjects. The default is{method}:{value}.--sparqlspecifies the SPARQL query to use for selecting subjects to hash. The default isSELECT ?s WHERE { ?s ?p ?o . FILTER(isBlank(?s))}(Selecting all Blank Node subjects).- Run
rdfhash --helpfor more information on available parameters.
Output:
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix md5: <http://rdfhash.com/instances/md5/> .
md5:8fc18e400ff531e5cbe02fef751662ba
a hash:Contact ;
hash:phone "487-538-2824" ;
hash:email "johnsmith@example.com" ;
hash:name md5:5fd42f2c072c80e3db760c3fc69b91b8 ;
hash:address md5:9a3e3ce644e2c5271015d9665675a8e5 .
md5:5fd42f2c072c80e3db760c3fc69b91b8
a hash:LegalName ;
hash:firstName "John" ;
hash:lastName "Smith" .
md5:9a3e3ce644e2c5271015d9665675a8e5
a hash:Address ;
hash:street "4567 Mountain Peak Way" ;
hash:city "Denver" ;
hash:state "CO" ;
hash:zip "80202" ;
hash:country "USA" .
Import as a Python Module
from rdfhash import hash_subjects
data = '''
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix sha1: <http://rdfhash.com/instances/sha1/> .
<http://rdfhash.com/instances/Meaning-of-Life>
a hash:Attribute ;
hash:value 42 .
'''
graph, subjects_replaced = hash_subjects(
data,
method='sha1',
template='http://rdfhash.com/instances/{method}/{value}',
sparql_select_subjects='''
prefix hash: <http://rdfhash.com/ontology/>
SELECT ?s WHERE { ?s a hash:Attribute. }
'''
)
print(graph.serialize(format='turtle'))
Output:
@prefix hash: <http://rdfhash.com/ontology/> .
@prefix sha1: <http://rdfhash.com/instances/sha1/> .
sha1:4afe716d630b17d5a5d06f0901800e16f3e8c9a4
a hash:Attribute ;
hash:value 42 .
Limitations
It's important to note where rdfhash is limited in its functionality. These limitations are expected to be addressed in future versions.
- The
rdfhashtool does not yet fully support Named Graphs (e.g.text/trigorapplication/n-quads)- Users can still attempt to pass RDF data containing Named Graphs, although the expected output has not yet been tested.
- Circular dependencies between selected subjects are currently not allowed. (e.g. Inverse properties). A Directed Acyclic Graph (DAG) is required at the moment.
- Best practice to follow is prioritizing broader-to-narrower relationships. (e.g. A person
Contactpoints toLegalNameandAddressand not inversely. Multiple contacts can point to the sameLegalNameorAddress.) - Future
rdfhashversions will support ignoring specific properties used in a subject's hash, allowing the use of inverse properties.
- Best practice to follow is prioritizing broader-to-narrower relationships. (e.g. A person
- Currently, selected subjects are expected to be fully defined in the input graph.
- Future
rdfhashversions will support connections to a SPARQL endpoint to fetch full context for hashing.
- Future
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rdfhash-0.5.0.tar.gz.
File metadata
- Download URL: rdfhash-0.5.0.tar.gz
- Upload date:
- Size: 43.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ebac6372ec28069886f5b4c2dd5a97ccfc69c82fb46786c10ead1705c7ceb76
|
|
| MD5 |
03fae9e41d52fee0d577e140178c871a
|
|
| BLAKE2b-256 |
878746e07ef2217b7a9318aad1b15972535a27b49b7e6330cb6c086fd3f1761d
|
File details
Details for the file rdfhash-0.5.0-py3-none-any.whl.
File metadata
- Download URL: rdfhash-0.5.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f36554a7efd4c8774a8cf9dbbe15252519b8a7a0e3a1351c7aceabe41e128f76
|
|
| MD5 |
aadda40aeb575016c8263ea075af87b4
|
|
| BLAKE2b-256 |
ffd8acf244b8daa978c71bef7ee964636f96fd2c784546388a612116b1eaf0db
|