No project description provided

These details have not been verified by PyPI

Project description

OWS-Curlie-2025 - The Open Web Search Curlie 2025 test collection

This repository contains the Open Web Search Curlie 2025 test collection (collection home page). The collection was created by taking six months worth of crawl data (March 1st until August 31st, 2025) from the Open Web Search project and transforming it using the standard Open Web Search preprocessing and indexing pipelines. We have only included data from the curlie subcollection (webpages that have an associated label in the Curlie directory), and filtered on English documents.

The collection contains three data splits: radboud, radboud-val and jena-kassel. For each split, students from the universities of Radboud, Jena and Kassel have provided topics and relevance assessments. The radboud-val split is a validation dataset that contains qrels; the other two splits are test collections and qrels are withheld. We have subsampled the corpus for each split to keep the size of the collection manageable.

Each split contains the following files:

topics.tsv: the topics for the split
subsample.parquet: the subsampled collection, created by running a number of (cheap) retrieval models for the topic set and only retaining those documents that were retrieved at least once. The Parquet files contain the standard Open Web Search preprocessing metadata.
subsample.jsonl.gz: the same subsampled collection, but in a simplified JSON format.
subsample_embeddings.parquet: Jina V3 embeddings derived in the Open Web Search GPU processing pipeline
(For the validation split only) qrels.txt: the qrels for the validation set

For completeness, we also release the full (deduplicated) collection, alongside the subsampled versions. The full collection is distributed as 500 Parquet files in the documents directory, and does not contain embeddings.

The collection may be updated with more files, like pre-computed CIFF indexes for sparse retrieval or anchor text data. We will also release the full collection alongside the subsampled versions.

Usage

To use the dataset, you first have to download the data using owilix, our command line tool for access to the Open Web Index. Then, you can either process the data yourself, use our custom ir-datasets-owi Python package to obtain an easy-to-use ir-datasets wrapper around the dataset.

Downloading the data with Owilix

First, download owilix. Our recommended installation approach is to use the one-line install script:

https://opencode.it4i.eu/openwebsearcheu-public/owi-cli

You then get access to an owilix command line application. Please refer to the owilix documentation for more information on how to use it. For now, issue the following command to list the dataset and its contents (be sure to accept the license agreement when it pops up):

owilix remote ls 'all/title=Open Web Search Curlie 2025' --file-details --files '*'

You can download the data to your local machine by using owilix pull:

owilix remote pull 'all/title=Open Web Search Curlie 2025'

Note that you can select only certain files (e.g. if you do not need the JSON version of the subsampled corpus, or are only interested in a certain split) by using the files option. This is especially useful if you only want to work with the subsamples and not the full ~70GB collection. For example:

owilix remote pull 'all/title=Open Web Search Curlie 2025' 'files=**/*.parquet'

The files will be downloaded to the following directory (with the default owilix configuration):

~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd

Loading the collection with ir-datasets

We have created a custom ir-datasets integration for the OWS-Curlie-2025 collection (source code). To download the integration, simply issue:

pip install ir-datasets-owi

Then, make sure that the collection can be found in your ir-datasets home directory:

ln -s ~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd ~/.ir_datasets/ows-curlie-2025

You can now use the ir-datasets integration as follows:

import ir_datasets_owi
import ir_datasets

ir_datasets_owi.register()

dataset = ir_datasets.load("ows-curlie-2025/radboud-val")

for doc in dataset.docs_iter():
    print(doc)

The documents have the following fields:

doc_id: the document identifier (a SHA-256 hash of the normalized URL)
url: the page URL
main_content: the body of the web page with minimal HTML structure
title: the title of the web page
description: the meta description of the web page
plain_text: the main_content field with all HTML removed
default_text(): a concatenation of title and plain_text

We also provide a short-hand version that combines ir_datasets_owi.register and ir_datasets.load:

import ir_datasets_owi

dataset = ir_datasets_owi.load("ows-curlie-2025/radboud-val")

Both the register and the load methods support an optional batch_size parameter, which indicates how many documents are materialized per batch (higher is faster, but consumes more memory). The default of 2048 should be sufficient for most cases.

The following collections are accessible through the ir-datasets integration:

Dataset	#docs	Metadata	Embeddings	Topics	Qrels
`ows-curlie-2025/all`	20,746,181	✓	✗	✗	✗
`ows-curlie-2025/radboud-val`	63,639	✓	✓	✓	✓
`ows-curlie-2025/radboud`	90,852	✓	✓	✓	✗
`ows-curlie-2025/jena-kassel`	77,382	✓	✓	✓	✗

Submitting runs

We accept either Docker submissions or run file submissions to Tira/TIREx. More information on submissions will follow soon.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Feb 16, 2026

0.1.1

Jan 9, 2026

0.1.0

Jan 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets_owi-0.1.2.tar.gz (4.6 kB view details)

Uploaded Feb 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ir_datasets_owi-0.1.2-py3-none-any.whl (5.1 kB view details)

Uploaded Feb 16, 2026 Python 3

File details

Details for the file ir_datasets_owi-0.1.2.tar.gz.

File metadata

Download URL: ir_datasets_owi-0.1.2.tar.gz
Upload date: Feb 16, 2026
Size: 4.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ir_datasets_owi-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e20bb8c1d9ea1859fa621d0b0966c5823da2cb3eb1f5d9a09245c82c825edd9d`
MD5	`fff9b635381256698c5a9218fbd44c81`
BLAKE2b-256	`063592b815fab2b9d3ea098320cbd4620dbaafa3c58f3ffc829598f7d520fda7`

See more details on using hashes here.

File details

Details for the file ir_datasets_owi-0.1.2-py3-none-any.whl.

File metadata

Download URL: ir_datasets_owi-0.1.2-py3-none-any.whl
Upload date: Feb 16, 2026
Size: 5.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ir_datasets_owi-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3186ecb003dfc37bf123bd632409dce0ff18b5d4bf1d0528b4de8cbadcb16e7`
MD5	`2d082c85ece7eabf81c89ec91a147073`
BLAKE2b-256	`2178dabfe8e9982323448a55debc0f735221f5527eacc823bec15177bdfd4eda`

See more details on using hashes here.

ir-datasets-owi 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

OWS-Curlie-2025 - The Open Web Search Curlie 2025 test collection

Contents

Usage

Downloading the data with Owilix

Loading the collection with ir-datasets

Submitting runs

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes