Skip to main content

No project description provided

Project description

OWS-Curlie-2025 - The Open Web Search Curlie 2025 test collection

This repository contains the Open Web Search Curlie 2025 test collection (collection home page). The collection was created by taking six months worth of crawl data (March 1st until August 31st, 2025) from the Open Web Search project and transforming it using the standard Open Web Search preprocessing and indexing pipelines. We have only included data from the curlie subcollection (webpages that have an associated label in the Curlie directory), and filtered on English documents.

Contents

The collection contains three data splits: radboud, radboud-val and jena-kassel. For each split, students from the universities of Radboud, Jena and Kassel have provided topics and relevance assessments. The radboud-val split is a validation dataset that contains qrels; the other two splits are test collections and qrels are withheld. We have subsampled the corpus for each split to keep the size of the collection manageable.

Each split contains the following files:

  • topics.tsv: the topics for the split
  • subsample.parquet: the subsampled collection, created by running a number of (cheap) retrieval models for the topic set and only retaining those documents that were retrieved at least once. The Parquet files contain the standard Open Web Search preprocessing metadata.
  • subsample.jsonl.gz: the same subsampled collection, but in a simplified JSON format.
  • subsample_embeddings.parquet: Jina V3 embeddings derived in the Open Web Search GPU processing pipeline
  • (For the validation split only) qrels.txt: the qrels for the validation set

For completeness, we also release the full (deduplicated) collection, alongside the subsampled versions. The full collection is distributed as 500 Parquet files in the documents directory, and does not contain embeddings.

The collection may be updated with more files, like pre-computed CIFF indexes for sparse retrieval or anchor text data. We will also release the full collection alongside the subsampled versions.

Usage

To use the dataset, you first have to download the data using owilix, our command line tool for access to the Open Web Index. Then, you can either process the data yourself, use our custom ir-datasets-owi Python package to obtain an easy-to-use ir-datasets wrapper around the dataset.

Downloading the data with Owilix

First, download owilix. Our recommended installation approach is to use the one-line install script:

https://opencode.it4i.eu/openwebsearcheu-public/owi-cli

You then get access to an owilix command line application. Please refer to the owilix documentation for more information on how to use it. For now, issue the following command to list the dataset and its contents (be sure to accept the license agreement when it pops up):

owilix remote ls 'all/title=Open Web Search Curlie 2025' --file-details --files '*'

You can download the data to your local machine by using owilix pull:

owilix remote pull 'all/title=Open Web Search Curlie 2025'

Note that you can select only certain files (e.g. if you do not need the JSON version of the subsampled corpus, or are only interested in a certain split) by using the files option. This is especially useful if you only want to work with the subsamples and not the full ~70GB collection. For example:

owilix remote pull 'all/title=Open Web Search Curlie 2025' 'files=**/*.parquet'

The files will be downloaded to the following directory (with the default owilix configuration):

~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd

Loading the collection with ir-datasets

We have created a custom ir-datasets integration for the OWS-Curlie-2025 collection (source code). To download the integration, simply issue:

pip install ir-datasets-owi

Then, make sure that the collection can be found in your ir-datasets home directory:

ln -s ~/.owi/public/special/b510baa6-ebd2-11f0-8c43-02a47ca5d9fd ~/.ir_datasets/ows-curlie-2025

You can now use the ir-datasets integration as follows:

import ir_datasets_owi
import ir_datasets

ir_datasets_owi.register()

dataset = ir_datasets.load("ows-curlie-2025/radboud-val")

for doc in dataset.docs_iter():
    print(doc)

The documents have the following fields:

  • doc_id: the document identifier (a SHA-256 hash of the normalized URL)
  • url: the page URL
  • main_content: the body of the web page with minimal HTML structure
  • title: the title of the web page
  • description: the meta description of the web page
  • plain_text: the main_content field with all HTML removed
  • default_text(): a concatenation of title and plain_text

We also provide a short-hand version that combines ir_datasets_owi.register and ir_datasets.load:

import ir_datasets_owi

dataset = ir_datasets_owi.load("ows-curlie-2025/radboud-val")

Both the register and the load methods support an optional batch_size parameter, which indicates how many documents are materialized per batch (higher is faster, but consumes more memory). The default of 2048 should be sufficient for most cases.

The following collections are accessible through the ir-datasets integration:

Dataset #docs Metadata Embeddings Topics Qrels
ows-curlie-2025/all 20,746,181
ows-curlie-2025/radboud-val 63,639
ows-curlie-2025/radboud 90,852
ows-curlie-2025/jena-kassel 77,382

Submitting runs

We accept either Docker submissions or run file submissions to Tira/TIREx. More information on submissions will follow soon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets_owi-0.1.2.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ir_datasets_owi-0.1.2-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file ir_datasets_owi-0.1.2.tar.gz.

File metadata

  • Download URL: ir_datasets_owi-0.1.2.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ir_datasets_owi-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e20bb8c1d9ea1859fa621d0b0966c5823da2cb3eb1f5d9a09245c82c825edd9d
MD5 fff9b635381256698c5a9218fbd44c81
BLAKE2b-256 063592b815fab2b9d3ea098320cbd4620dbaafa3c58f3ffc829598f7d520fda7

See more details on using hashes here.

File details

Details for the file ir_datasets_owi-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ir_datasets_owi-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ir_datasets_owi-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e3186ecb003dfc37bf123bd632409dce0ff18b5d4bf1d0528b4de8cbadcb16e7
MD5 2d082c85ece7eabf81c89ec91a147073
BLAKE2b-256 2178dabfe8e9982323448a55debc0f735221f5527eacc823bec15177bdfd4eda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page