Skip to main content

provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.

Project description

ir_datasets

ir_datasets is a python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.

The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.

ir_datasets provides a common iterator format to allow them to be easily used in python. It attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to allow quick lookups of documents by ID.

A command line interface is also available.

You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!

Getting Started

For a quick start with the Python API, check out our Colab tutorials: Python Command Line

Install via pip:

pip install ir_datasets

If you want the main branch, you install as such:

pip install git+https://github.com/allenai/ir_datasets.git

If you want to build from source, use:

$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl

Tested with python versions 3.7, 3.8, 3.9, and 3.10. (Mininum python version is 3.7.)

Features

Python and Command Line Interfaces. Access datasts both through a simple Python API and via the command line.

import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
# Documents
for doc in dataset.docs_iter():
    print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...
ir_datasets export msmarco-passage/train docs | head -n2
0 The presence of communication amid scientific minds was equally important to the success of the Manh...
1 The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...

Automatically downloads source files (when available). Will download and verify the source files for queries, documents, qrels, etc. when they are publicly available, as they are needed. A CI build checks weekly to ensure that all the downloadable content is available and correct: Downloadable Content. We mirror some troublesome files on mirror.ir-datasets.com, and automatically switch to the mirror when the original source is not available.

import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
for doc in dataset.docs_iter(): # Will download and extract MS-MARCO's collection.tar.gz the first time
    ...
for query in dataset.queries_iter(): # Will download and extract MS-MARCO's queries.tar.gz the first time
    ...

Instructions for dataset access (when not publicly available). Provides instructions on how to get a copy of the data when it is not publicly available online (e.g., when it requires a data usage agreement).

import ir_datasets
dataset = ir_datasets.load('trec-arabic')
for doc in dataset.docs_iter():
    ...
# Provides the following instructions:
# The dataset is based on the Arabic Newswire corpus. It is available from the LDC via: <https://catalog.ldc.upenn.edu/LDC2001T55>
# To proceed, symlink the source file here: [gives path]

Support for datasets big and small. By using iterators, supports large datasets that may not fit into system memory, such as ClueWeb.

import ir_datasets
dataset = ir_datasets.load('clueweb09')
for doc in dataset.docs_iter():
    ... # will iterate through all ~1B documents

Fixes known dataset issues. For instance, automatically corrects the document UTF-8 encoding problem in the MS-MARCO passage collection.

import ir_datasets
dataset = ir_datasets.load('msmarco-passage')
docstore = dataset.docs_store()
docstore.get('243').text
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/ˈkeɪnz/ KAYNZ; 5 June 1883 – 21 April [SNIP]"
# Naïve UTF-8 decoding yields double-encoding artifacts like:
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/Ë\x88keɪnz/ KAYNZ; 5 June 1883 â\x80\x93 21 April [SNIP]"
#                                                  ~~~~~~  ~~                       ~~~~~~~~~

Fast Random Document Access. Builds data structures that allow fast and efficient lookup of document content. For large datasets, such as ClueWeb, uses checkpoint files to load documents from source 40x faster than normal. Results are cached for even faster subsequent accesses.

import ir_datasets
dataset = ir_datasets.load('clueweb12')
docstore = dataset.docs_store()
docstore.get_many(['clueweb12-0000tw-05-00014', 'clueweb12-0000tw-05-12119', 'clueweb12-0106wb-18-19516'])
# {'clueweb12-0000tw-05-00014': ..., 'clueweb12-0000tw-05-12119': ..., 'clueweb12-0106wb-18-19516': ...}

Fancy Iter Slicing. Sometimes it's helpful to be able to select ranges of data (e.g., for processing document collections in parallel on multiple devices). Efficient implementations of slicing operations allow for much faster dataset partitioning than using itertools.slice.

import ir_datasets
dataset = ir_datasets.load('clueweb12')
dataset.docs_iter()[500:1000] # normal slicing behavior
# WarcDoc(doc_id='clueweb12-0000tw-00-00502', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00503', ...), ...
dataset.docs_iter()[-10:-8] # includes negative indexing
# WarcDoc(doc_id='clueweb12-1914wb-28-24245', ...), WarcDoc(doc_id='clueweb12-1914wb-28-24246', ...)
dataset.docs_iter()[::100] # includes support for skip (only positive values)
# WarcDoc(doc_id='clueweb12-0000tw-00-00000', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00100', ...), ...
dataset.docs_iter()[1/3:2/3] # supports proportional slicing (this takes the middle third of the collection)
# WarcDoc(doc_id='clueweb12-0605wb-28-12714', ...), WarcDoc(doc_id='clueweb12-0605wb-28-12715', ...), ...

Datasets

Available datasets include:

There are "subsets" under each dataset. For instance, clueweb12/b13/trec-misinfo-2019 provides the queries and judgments from the 2019 TREC misinformation track, and msmarco-document/orcas provides the ORCAS dataset. They tend to be organized with the document collection at the top level.

See the ir_dataets docs (ir_datasets.com) for details about each dataset, its available subsets, and what data they provide.

Environment variables

  • IR_DATASETS_HOME: Home directory for ir_datasets data (default ~/.ir_datasets/). Contains directories for each top-level dataset.
  • IR_DATASETS_TMP: Temporary working directory (default /tmp/ir_datasets/).
  • IR_DATASETS_DL_TIMEOUT: Download stream read timeout, in seconds (default 15). If no data is received within this duration, the connection will be assumed to be dead, and another download may be attempted.
  • IR_DATASETS_DL_TRIES: Default number of download attempts before exception is thrown (default 3). When the server accepts Range requests, uses them. Otherwise, will download the entire file again
  • IR_DATASETS_DL_DISABLE_PBAR: Set to true to disable the progress bar for downloads. Useful in settings where an interactive console is not available.
  • IR_DATASETS_DL_SKIP_SSL: Set to true to disable checking SSL certificates when downloading files. Useful as a short-term solution when SSL certificates expire or are otherwise invalid. Note that this does not disable hash verification of the downloaded content.
  • IR_DATASETS_SKIP_DISK_FREE: Set to true to disable checks for enough free space on disk before downloading content or otherwise creating large files.
  • IR_DATASETS_SMALL_FILE_SIZE: The size of files that are considered "small", in bytes. Instructions for linking small files rather then downloading them are not shown. Defaults to 5000000 (5MB).

Citing

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on the datasets documentation page.

If you use this tool, please cite our SIGIR resource paper:

@inproceedings{macavaney:sigir2021-irds,
  author = {MacAvaney, Sean and Yates, Andrew and Feldman, Sergey and Downey, Doug and Cohan, Arman and Goharian, Nazli},
  title = {Simplified Data Wrangling with ir_datasets},
  year = {2021},
  booktitle = {SIGIR}
}

Credits

Contributors to this repository:

  • Sean MacAvaney (University of Glasgow)
  • Shuo Sun (Johns Hopkins University)
  • Thomas Jänich (University of Glasgow)
  • Jan Heinrich Reimer (Martin Luther University Halle-Wittenberg)
  • Maik Fröbe (Martin Luther University Halle-Wittenberg)
  • Eugene Yang (Johns Hopkins University)
  • Augustin Godinot (NAVERLABS Europe, ENS Paris-Saclay)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets-0.5.9.tar.gz (267.9 kB view details)

Uploaded Source

Built Distribution

ir_datasets-0.5.9-py3-none-any.whl (347.9 kB view details)

Uploaded Python 3

File details

Details for the file ir_datasets-0.5.9.tar.gz.

File metadata

  • Download URL: ir_datasets-0.5.9.tar.gz
  • Upload date:
  • Size: 267.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ir_datasets-0.5.9.tar.gz
Algorithm Hash digest
SHA256 35c90980fbd0f4ea8fe22a1ab16d2bb6be3dc373cbd6dfab1d905f176a70e5ac
MD5 710e3379d27f75d04c6bdad001ffd313
BLAKE2b-256 a0120f99bbd93c62b183d94b7b68ef570dae0cbc64b14e381e26d52aaa2f4827

See more details on using hashes here.

File details

Details for the file ir_datasets-0.5.9-py3-none-any.whl.

File metadata

  • Download URL: ir_datasets-0.5.9-py3-none-any.whl
  • Upload date:
  • Size: 347.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ir_datasets-0.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 07c9bed07f31031f1da1bc02afc7a1077b1179a3af402d061f83bf6fb833b90a
MD5 089a36347214bb1824cb9ed6647cd6c6
BLAKE2b-256 1f7d14194ad38c5ad4a96f79a7aa1da97c2e8796c22d15ba1bfafcfe8948d49f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page