provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
Project description
ir_datasets
ir_datasets
is a python package that provides a common interface to many IR ad-hoc ranking
benchmarks, training datasets, etc. It was built as a fork of OpenNIR to
allow easier integration with other systems.
The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.
ir_datasets
provides a common iterator format to allow them to be easily used in python. It
attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while
handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to
allow quick lookups of documents by ID.
A command line interface is also available.
You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!
Getting Started
For a quick start with the Python API, check out our Colab tutorial.
Install via pip:
pip install ir_datasets
If you want to build from source, use:
$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl
Tested with python versions 3.6 and 3.7
Python Interface
Load a dataset, such as the MS-MARCO passage ranking datset, using:
import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
A dataset object lets you iterate through supported properties like docs (dataset.docs_iter()
),
queries (dataset.queries_iter()
), and relevance judgments (dataset.qrels_iter()
). Each iterator
yields namedtuples, with fields based on the available data.
# Documents
for doc in dataset.docs_iter():
print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...
# Queries
for query in dataset.queries_iter():
print(query)
# GenericQuery(query_id='121352', text='define extreme')
# GenericQuery(query_id='634306', text='what does chattel mean on credit history')
# ...
# Query relevance judgments (qrels)
for qrel in dataset.qrels_iter():
print(qrels)
# TrecQrel(query_id='1185869', doc_id='0', relevance=1, iteration='0')
# TrecQrel(query_id='1185868', doc_id='16', relevance=1, iteration='0')
# ...
# Look up queries and documents by ID
queries_store = dataset.queries_store()
queries_store.get("1185868")
# GenericQuery(query_id='1185868', text='_________ justice is designed to repair the harm to victim, the comm...
dataset = ir_datasets.wrappers.DocstoreWrapper(dataset)
doc_store = dataset.docs_store()
doc_store.get("16")
# GenericDoc(doc_id='16', text='The approach is based on a theory of justice that considers crime and wrongdoi...
If you want to use your own dataset, you can construct an object with the same interface as the standard benchmarks by:
import ir_datasets
dataset = ir_datasets.create_dataset(
docs_tsv="path/to/docs.tsv",
queries_tsv="path/to/queries.tsv",
qrels_trec="path/to/qrels.trec"
)
Here, documents and queries are represented in TSV format with format [id]\t[text]
. Query
relevance judgments are provided in the standard TREC format:
[query_id] [iteration] [doc_id] [rel]
.
Command Line Interface
Export data in various formats:
$ ir_datasets export [dataset-id] [docs/queries/qrels/scoreddocs/docpairs]
$ ir_datasets export msmarco-passage/train docs | head -n2
0 The presence of communication amid scientific minds was equally important to the success of the Manh...
1 The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...
$ ir_datasets export msmarco-passage/train docs --format jsonl | head -n2
{"doc_id": "0", "text": "The presence of communication amid scientific minds was equally important to the su...
{"doc_id": "1", "text": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its ...
--format
specifies the output format (e.g., tsv or jsonl). --fields
specifies which fields to
include in the output. This depends on what fields are available in the dataset, but most try to
include common fields, e.g., text
in documents returns the text of the document, without any
markup.
Look up documents and queries by ID:
$ ir_datasets lookup [dataset-id] [--qid] [ids...]
--format
and --fields
also work here. --qid
indicates that queries should be looked up instead
of documents (default).
This is much faster than using ir_datasets export ... | grep
(or similar) because it indexes the
documents/queries by ID.
Features
Automatically downloads source files (when available). Will download and verify the source files for queries, documents, qrels, etc. when they are publicly available, as they are needed. A CI build checks weekly to ensure that all the downloadable content is available and correct:
import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
for doc in dataset.docs_iter(): # Will download and extract MS-MARCO's collection.tar.gz the first time
...
for query in dataset.queries_iter(): # Will download and extract MS-MARCO's queries.tar.gz the first time
...
Instructions for dataset access (when not publicly available). Provides instructions on how to get a copy of the data when it is not publicly available online (e.g., when it requires a data usage agreement).
import ir_datasets
dataset = ir_datasets.load('trec-arabic')
for doc in dataset.docs_iter():
...
# Provides the following instructions:
# The dataset is based on the Arabic Newswire corpus. It is available from the LDC via: <https://catalog.ldc.upenn.edu/LDC2001T55>
# To proceed, symlink the source file here: [gives path]
Support for datasets big and small. By using iterators, supports large datasets that may not fit into system memory, such as ClueWeb.
import ir_datasets
dataset = ir_datasets.load('clueweb09')
for doc in dataset.docs_iter():
... # will iterate through all ~1B documents
Fixes known dataset issues. For instance, automatically corrects the document UTF-8 encoding problem in the MS-MARCO passage collection.
import ir_datasets
dataset = ir_datasets.load('msmarco-passage')
docstore = dataset.docs_store()
docstore.get('243').text
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/ˈkeɪnz/ KAYNZ; 5 June 1883 – 21 April [SNIP]"
# Naïve UTF-8 decoding yields double-encoding artifacts like:
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/Ë\x88keɪnz/ KAYNZ; 5 June 1883 â\x80\x93 21 April [SNIP]"
# ~~~~~~ ~~ ~~~~~~~~~
Fast Random Document Access. Builds data structures that allow fast and efficient lookup of document content. For large datasets, such as ClueWeb, uses checkpoint files to load documents from source 40x faster than normal. Results are cached for even faster subsequent accesses.
import ir_datasets
dataset = ir_datasets.load('clueweb12')
docstore = dataset.docs_store()
docstore.get_many(['clueweb12-0000tw-05-00014', 'clueweb12-0000tw-05-12119', 'clueweb12-0106wb-18-19516'])
# {'clueweb12-0000tw-05-00014': ..., 'clueweb12-0000tw-05-12119': ..., 'clueweb12-0106wb-18-19516': ...}
Fancy Iter Slicing. Sometimes it's helpful to be able to select ranges of data (e.g., for processing
document collections in parallel on multiple devices). Efficient implementations of slicing operations
allow for much faster dataset partitioning than using itertools.slice
.
import ir_datasets
dataset = ir_datasets.load('clueweb12')
dataset.docs_iter()[500:1000] # normal slicing behavior
# WarcDoc(doc_id='clueweb12-0000tw-00-00502', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00503', ...), ...
dataset.docs_iter()[-10:-8] # includes negative indexing
# WarcDoc(doc_id='clueweb12-1914wb-28-24245', ...), WarcDoc(doc_id='clueweb12-1914wb-28-24246', ...)
dataset.docs_iter()[::100] # includes support for skip (only positive values)
# WarcDoc(doc_id='clueweb12-0000tw-00-00000', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00100', ...), ...
dataset.docs_iter()[1/3:2/3] # supports proportional slicing (this takes the middle third of the collection)
# WarcDoc(doc_id='clueweb12-0605wb-28-12714', ...), WarcDoc(doc_id='clueweb12-0605wb-28-12715', ...), ...
Datasets
Available datasets include:
- ANTIQUE
- AQUAINT
- TREC CAR
- ClueWeb09
- ClueWeb12
- CodeSearchNet
- CORD-19
- GOV
- GOV2
- Highwire (TREC Genomics 2006-07)
- Medline
- MSMARCO (document)
- MSMARCO (passage)
- MSMARCO (QnA)
- NFCorpus (NutritionFacts)
- NYT
- PubMed Central (TREC CDS)
- TREC Arabic
- TREC Mandarin
- TREC Robust 2004
- TREC Spanish
- Tweets 2013 (Internet Archive)
- Vaswani
- WikIR
There are "subsets" under each dataset. For instance, clueweb12/b13/trec-misinfo-2019
provides the
queries and judgments from the 2019 TREC misinformation track,
and msmarco-document/orcas
provides the ORCAS dataset. They
tend to be organized with the document collection at the top level.
See the ir_dataets docs (ir_datasets.com) for details about each dataset, its available subsets, and what data they provide. full list
Environment variables
IR_DATASETS_HOME
: Home directory for ir_datasets data (default~/.ir_datasets/
). Contains directories for each top-level dataset.IR_DATASETS_TMP
: Temporary working directory (default/tmp/ir_datasets/
).IR_DATASETS_DL_TIMEOUT
: Download stream read timeout, in seconds (default15
). If no data is received within this duration, the connection will be assumed to be dead, and another download may be attempted.IR_DATASETS_DL_TRIES
: Default number of download attempts before exception is thrown (default3
). When the server accepts Range requests, uses them. Otherwise, will download the entire file againIR_DATASETS_DL_DISABLE_PBAR
: Set totrue
to disable the progress bar for downloads. Useful in settings where an interactive console is not available.
Citing
When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset
can be found on the datasets documentation page,
or in the python interface via dataset.documentation()['bibtex']
(when available).
The ir_datasets
package was released as part of ABNIRML, so
please cite the following if you use this package:
@article{macavaney:arxiv2020-abnirml,
author = {MacAvaney, Sean and Feldman, Sergey and Goharian, Nazli and Downey, Doug and Cohan, Arman},
title = {ABNIRML: Analyzing the Behavior of Neural IR Models},
year = {2020},
url = {https://arxiv.org/abs/2011.00696},
journal = {arXiv},
volume = {abs/2011.00696}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ir_datasets-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ff84f4370f124223ba32e31c85a5a1eefac6d7b970ce87890e768052c0aa175 |
|
MD5 | cc3494f70ffdd0d97ff37e9922ae7258 |
|
BLAKE2b-256 | 7140fa2f34523dd975d821076bbd1df6bbef7c9ad7fcd379ac04e5ec3fec9cfb |