Skip to main content

Developer-friendly Queries over RDF triples

Project description

Bikidata

Queries over Wikidata dumps. Some background info in these slides.

Wikidata is an invaluable resource in our research activities. But in many cases we need to do complex queries, or extract large swathes of data. This is an example on how we can convert offline dumps of Wikidata into a format that we can query in a structured manner, or run extractions on.

Summary

We download a "truthy" dump in NT format. Split the uncompressed file into smaller chunks. Index the RDF structure using a fast hash-algorithm into parquet files of 3 unsigned 64-bit columns (subject-predicate-object). Index the IRIs and Literals into another parquet file of two columns, (hash, literal/IRI) and query the resulting files using DuckDB.

Download and Split the files

The downloaded RDF dump files are very large, so we first split them into smaller chunks. This can be done using command in a unix shell like:
bzcat latest-truthy.nt.bz2 | split -u --lines=500000000

In a dump from June 2023, this results in 16 file chunks, each containing 500 million lines of text (except for the last chunk which is obviously smaller) The text file chunks are named xaa,xab ... xap by the split command.

Index the RDF structure

For each chunked text file we run the command python index.py xaa etc. where xaa are the chunked names. The index.py command calculates a hash of each part of the triple, which is a large integer. A Bloom filter is used to track if a given triple has been indexed or not. A triple is only added to the output parquet file the first time it is seen.

For each chunked file, a separate xa?.parquet is produced.

Map the IRI and Literals

For each chunked text file, we also run the command python map.py xaa etc. This also uses a Bloomfilter to check if a given IRI or Literal has been seen, and only adds new items. But, as a given Literal or IRI can be used multiple times, only a single file index.parquet is produced and read into the Bloomfilter at the start of running the command. This means that the map part can not be run in parallel, and must be run sequentially for each file.

Querying the data

Once the index and map steps have been done, we can query the data using Duckdb. Running the Duckdb CLI interface, we can for example do commands like:

select count(*) from 'xa?.parquet';
   7494368474
select * from 'index.parquet' where hash = 12746726515823639617;
   <http://www.wikidata.org/entity/Q53592>

Or the request from Q30078997, to find similar items to the above book:

WITH Q53592_po AS (SELECT p,o FROM 'xa?.parquet' WHERE s = 12746726515823639617)
SELECT p_cnt, (SELECT iri FROM 'index.parquet' WHERE hash = s)
  FROM (SELECT t.s, count(t.p) p_cnt FROM 'xa?.parquet' t
   INNER JOIN Q53592_po ON t.p = Q53592_po.p AND t.o = Q53592_po.o
   GROUP BY t.s
   ORDER BY count(t.p) DESC)
WHERE p_cnt > 10;

which is an interpretation of:

SELECT ?book (COUNT(DISTINCT ?o) as ?score)
WHERE {
  wd:Q53592 ?p ?o .
  ?book wdt:P31 wd:Q571 ;
        ?p ?o .
} GROUP BY ?book
ORDER BY DESC(?score)

(which has a timeout when you run it on the WDQS)

TODO List

  • Add some explanation on: "Why not just use HDT?"
  • Sort the tri table to improve speed of queries. Solve OOM problem, sort has to be on-disk.
  • Make smaller extracts, like (P31 Q5)
  • A quick Property lookup, with labels
  • A "labels" service
  • A bikidata package on PYPI
  • Index literals using embeddings and a HNSW (SBERT + FAIS?)
  • Make a fast membership index for the large P31 sets using a Bloomfilter, and add it as a UDF to bikidata package
  • Add a SPARQL translation engine 🤓 Hah, ambitious.

Also see

qEndpoint for Wikidata

Triplestore Benchmarks for Wikidata

Python HDT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bikidata-0.2.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bikidata-0.2-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file bikidata-0.2.tar.gz.

File metadata

  • Download URL: bikidata-0.2.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.11 Darwin/23.6.0

File hashes

Hashes for bikidata-0.2.tar.gz
Algorithm Hash digest
SHA256 08077d98df987359d450e0e21d0143cb1b95b15ef8df1cec440263e921560d3c
MD5 e444cd3d9b1cb214473fe49f632745c3
BLAKE2b-256 3f9d87feb306c367d0ab3b7212308396f3a2527a7ac41437c7ed801fa1f83723

See more details on using hashes here.

File details

Details for the file bikidata-0.2-py3-none-any.whl.

File metadata

  • Download URL: bikidata-0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.11 Darwin/23.6.0

File hashes

Hashes for bikidata-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ca50355f2a999d9b3c320cf75266d64cf71ead3565ccbf106087052a6a10b4d6
MD5 1c4a70f322a44451bb4688d3de79a0a6
BLAKE2b-256 943d599bac3c50e3e87663a003c3a0c652d32b3f5cd37f52accb2d0a594fdb6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page