Developer-friendly Queries over RDF triples

These details have not been verified by PyPI

Project links

Project description

Bikidata

Queries over Wikidata dumps. Some background info in these slides.

Wikidata is an invaluable resource in our research activities. But in many cases we need to do complex queries, or extract large swathes of data. This is an example on how we can convert offline dumps of Wikidata into a format that we can query in a structured manner, or run extractions on.

Summary

We download a "truthy" dump in NT format. Split the uncompressed file into smaller chunks. Index the RDF structure using a fast hash-algorithm into parquet files of 3 unsigned 64-bit columns (subject-predicate-object). Index the IRIs and Literals into another parquet file of two columns, (hash, literal/IRI) and query the resulting files using DuckDB.

Download and Split the files

The downloaded RDF dump files are very large, so we first split them into smaller chunks. This can be done using command in a unix shell like:
bzcat latest-truthy.nt.bz2 | split -u --lines=500000000

In a dump from June 2023, this results in 16 file chunks, each containing 500 million lines of text (except for the last chunk which is obviously smaller) The text file chunks are named xaa,xab ... xap by the split command.

Index the RDF structure

For each chunked text file we run the command python index.py xaa etc. where xaa are the chunked names. The index.py command calculates a hash of each part of the triple, which is a large integer. A Bloom filter is used to track if a given triple has been indexed or not. A triple is only added to the output parquet file the first time it is seen.

For each chunked file, a separate xa?.parquet is produced.

Map the IRI and Literals

For each chunked text file, we also run the command python map.py xaa etc. This also uses a Bloomfilter to check if a given IRI or Literal has been seen, and only adds new items. But, as a given Literal or IRI can be used multiple times, only a single file index.parquet is produced and read into the Bloomfilter at the start of running the command. This means that the map part can not be run in parallel, and must be run sequentially for each file.

Querying the data

Once the index and map steps have been done, we can query the data using Duckdb. Running the Duckdb CLI interface, we can for example do commands like:

select count(*) from 'xa?.parquet';
   7494368474

select * from 'index.parquet' where hash = 12746726515823639617;
   <http://www.wikidata.org/entity/Q53592>

Or the request from Q30078997, to find similar items to the above book:

WITH Q53592_po AS (SELECT p,o FROM 'xa?.parquet' WHERE s = 12746726515823639617)
SELECT p_cnt, (SELECT iri FROM 'index.parquet' WHERE hash = s)
  FROM (SELECT t.s, count(t.p) p_cnt FROM 'xa?.parquet' t
   INNER JOIN Q53592_po ON t.p = Q53592_po.p AND t.o = Q53592_po.o
   GROUP BY t.s
   ORDER BY count(t.p) DESC)
WHERE p_cnt > 10;

which is an interpretation of:

SELECT ?book (COUNT(DISTINCT ?o) as ?score)
WHERE {
  wd:Q53592 ?p ?o .
  ?book wdt:P31 wd:Q571 ;
        ?p ?o .
} GROUP BY ?book
ORDER BY DESC(?score)

(which has a timeout when you run it on the WDQS)

TODO List

Add some explanation on: "Why not just use HDT?"
Sort the tri table to improve speed of queries. Solve OOM problem, sort has to be on-disk.
Make smaller extracts, like (P31 Q5)
A quick Property lookup, with labels
A "labels" service
A bikidata package on PYPI
Index literals using embeddings and a HNSW (SBERT + FAIS?)
Make a fast membership index for the large P31 sets using a Bloomfilter, and add it as a UDF to bikidata package
Add a SPARQL translation engine 🤓 Hah, ambitious.

Also see

qEndpoint for Wikidata

Triplestore Benchmarks for Wikidata

Python HDT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4

Feb 4, 2026

0.3.7

Nov 4, 2025

0.3.6

Nov 4, 2025

0.3.5

Sep 22, 2025

0.3.4

Sep 21, 2025

0.3.3

Sep 21, 2025

0.3.2

Sep 16, 2025

0.3.1

Sep 2, 2025

0.3.0

Aug 11, 2025

0.2.22

Aug 8, 2025

0.2.21

Aug 4, 2025

0.2.20

Aug 4, 2025

0.2.19

May 15, 2025

0.2.18

May 14, 2025

0.2.17

May 2, 2025

0.2.16

Apr 30, 2025

0.2.15

Apr 30, 2025

0.2.14

Apr 11, 2025

0.2.13

Apr 11, 2025

0.2.12

Apr 11, 2025

0.2.11

Apr 11, 2025

0.2.10

Apr 11, 2025

0.2.9

Apr 8, 2025

0.2.8

Apr 2, 2025

0.2.7

Apr 1, 2025

0.2.6

Apr 1, 2025

0.2.5

Mar 31, 2025

0.2.4

Mar 31, 2025

0.2.3

Mar 28, 2025

0.2.2

Mar 28, 2025

0.2.1

Mar 26, 2025

This version

0.2

Mar 26, 2025

0.1

Dec 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bikidata-0.2.tar.gz (6.9 kB view details)

Uploaded Mar 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bikidata-0.2-py3-none-any.whl (7.3 kB view details)

Uploaded Mar 26, 2025 Python 3

File details

Details for the file bikidata-0.2.tar.gz.

File metadata

Download URL: bikidata-0.2.tar.gz
Upload date: Mar 26, 2025
Size: 6.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.11 Darwin/23.6.0

File hashes

Hashes for bikidata-0.2.tar.gz
Algorithm	Hash digest
SHA256	`08077d98df987359d450e0e21d0143cb1b95b15ef8df1cec440263e921560d3c`
MD5	`e444cd3d9b1cb214473fe49f632745c3`
BLAKE2b-256	`3f9d87feb306c367d0ab3b7212308396f3a2527a7ac41437c7ed801fa1f83723`

See more details on using hashes here.

File details

Details for the file bikidata-0.2-py3-none-any.whl.

File metadata

Download URL: bikidata-0.2-py3-none-any.whl
Upload date: Mar 26, 2025
Size: 7.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.11 Darwin/23.6.0

File hashes

Hashes for bikidata-0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca50355f2a999d9b3c320cf75266d64cf71ead3565ccbf106087052a6a10b4d6`
MD5	`1c4a70f322a44451bb4688d3de79a0a6`
BLAKE2b-256	`943d599bac3c50e3e87663a003c3a0c652d32b3f5cd37f52accb2d0a594fdb6e`

See more details on using hashes here.

bikidata 0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bikidata

Summary

Download and Split the files

Index the RDF structure

Map the IRI and Literals

Querying the data

TODO List

Also see

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes