No project description provided
Project description
OWS-Curlie-2025 - The Open Web Search Curlie 2025 test collection
This repository contains the Open Web Search Curlie 2025 test collection (collection home page). The collection was created by taking six months worth of crawl data (March 1st until August 31st, 2025) from the Open Web Search project and transforming it using the standard Open Web Search preprocessing and indexing pipelines. We have only included data from the curlie subcollection (webpages that have an associated label in the Curlie directory), and filtered on English documents.
Contents
The collection contains three data splits: radboud, radboud-val and jena-kassel. For each split, students from the universities of Radboud, Jena and Kassel have provided topics and relevance assessments. The radboud-val split is a validation dataset that contains qrels; the other two splits are test collections and qrels are withheld. We have subsampled the corpus for each split to keep the size of the collection manageable.
Each split contains the following files:
topics.tsv: the topics for the splitsubsample.parquet: the subsampled collection, created by running a number of (cheap) retrieval models for the topic set and only retaining those documents that were retrieved at least once. The Parquet files contain the standard Open Web Search preprocessing metadata.subsample.jsonl.gz: the same subsampled collection, but in a simplified JSON format.subsample_embeddings.parquet: Jina V3 embeddings derived in the Open Web Search GPU processing pipeline- (For the validation split only)
qrels.txt: the qrels for the validation set
For completeness, we also release the full (deduplicated) collection, alongside the subsampled versions. The full collection is distributed as 500 Parquet files in the documents directory, and does not contain embeddings.
The collection may be updated with more files, like pre-computed CIFF indexes for sparse retrieval or anchor text data. We will also release the full collection alongside the subsampled versions.
Usage
To use the dataset, you first have to download the data using owilix, our command line tool for access to the Open Web Index. Then, you can either process the data yourself, use our custom ir-datasets-owi Python package to obtain an easy-to-use ir-datasets wrapper around the dataset.
Downloading the data with Owilix
First, download owilix. Our recommended installation approach is to use the one-line install script:
https://opencode.it4i.eu/openwebsearcheu-public/owi-cli
You then get access to an owilix command line application. Please refer to the owilix documentation for more information on how to use it. For now, issue the following command to list the dataset and its contents (be sure to accept the license agreement when it pops up):
owilix remote ls 'all/title=Open Web Search Curlie 2025' --file-details --files '*'
You can download the data to your local machine by using owilix pull:
owilix remote pull 'all/title=Open Web Search Curlie 2025'
Note that you can select only certain files (e.g. if you do not need the JSON version of the subsampled corpus, or are only interested in a certain split) by using the files option. This is especially useful if you only want to work with the subsamples and not the full ~70GB collection. For example:
owilix remote pull 'all/title=Open Web Search Curlie 2025' 'files=**/*.parquet'
The files will be downloaded to the following directory (with the default owilix configuration):
~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd
Loading the collection with ir-datasets
We have created a custom ir-datasets integration for the OWS-Curlie-2025 collection (source code). To download the integration, simply issue:
pip install ir-datasets-owi
Then, make sure that the collection can be found in your ir-datasets home directory:
ln -s ~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd ~/.ir_datasets/ows-curlie-2025
You can now use the ir-datasets integration as follows:
import ir_datasets_owi
import ir_datasets
ir_datasets_owi.register()
dataset = ir_datasets.load("ows-curlie-2025/radboud-val")
for doc in dataset.docs_iter():
print(doc)
The documents have the following fields:
doc_id: the document identifier (a SHA-256 hash of the normalized URL)url: the page URLmain_content: the body of the web page with minimal HTML structuretitle: the title of the web pagedescription: the meta description of the web pageplain_text: themain_contentfield with all HTML removeddefault_text(): a concatenation oftitleandplain_text
We also provide a short-hand version that combines ir_datasets_owi.register and ir_datasets.load:
import ir_datasets_owi
dataset = ir_datasets_owi.load("ows-curlie-2025/radboud-val")
Both the register and the load methods support an optional batch_size parameter, which indicates how many documents are materialized per batch (higher is faster, but consumes more memory). The default of 2048 should be sufficient for most cases.
The following collections are accessible through the ir-datasets integration:
| Dataset | #docs | Metadata | Embeddings | Topics | Qrels |
|---|---|---|---|---|---|
ows-curlie-2025/all |
20,746,181 | ✓ | ✗ | ✗ | ✗ |
ows-curlie-2025/radboud-val |
63,639 | ✓ | ✓ | ✓ | ✓ |
ows-curlie-2025/radboud |
90,852 | ✓ | ✓ | ✓ | ✗ |
ows-curlie-2025/jena-kassel |
77,382 | ✓ | ✓ | ✓ | ✗ |
Submitting runs
We accept either Docker submissions or run file submissions to Tira/TIREx. More information on submissions will follow soon.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ir_datasets_owi-0.1.2.tar.gz.
File metadata
- Download URL: ir_datasets_owi-0.1.2.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e20bb8c1d9ea1859fa621d0b0966c5823da2cb3eb1f5d9a09245c82c825edd9d
|
|
| MD5 |
fff9b635381256698c5a9218fbd44c81
|
|
| BLAKE2b-256 |
063592b815fab2b9d3ea098320cbd4620dbaafa3c58f3ffc829598f7d520fda7
|
File details
Details for the file ir_datasets_owi-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ir_datasets_owi-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3186ecb003dfc37bf123bd632409dce0ff18b5d4bf1d0528b4de8cbadcb16e7
|
|
| MD5 |
2d082c85ece7eabf81c89ec91a147073
|
|
| BLAKE2b-256 |
2178dabfe8e9982323448a55debc0f735221f5527eacc823bec15177bdfd4eda
|