provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
Project description
ir_datasets
ir_datasets
is a python package that provides a common interface to many IR ad-hoc ranking
benchmarks, training datasets, etc. It was built as a fork of OpenNIR to
allow easier integration with other systems.
The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.
ir_datasets
provides a common iterator format to allow them to be easily used in python. It
attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while
handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to
allow quick lookups of documents by ID.
A command line interface is also available.
You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!
Getting Started
Install locally with:
$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl
Tested with python versions 3.6 and 3.7
Python Interface
Load a dataset, such as the MS-MARCO passage ranking datset, using:
import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
A dataset object lets you iterate through supported properties like docs (dataset.docs_iter()
),
queries (dataset.queries_iter()
), and relevance judgments (dataset.qrels_iter()
). Each iterator
yields namedtuples, with fields based on the available data.
# Documents
for doc in dataset.docs_iter():
print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...
# Queries
for query in dataset.queries_iter():
print(query)
# GenericQuery(query_id='121352', text='define extreme')
# GenericQuery(query_id='634306', text='what does chattel mean on credit history')
# ...
# Query relevance judgments (qrels)
for qrel in dataset.qrels_iter():
print(qrels)
# TrecQrel(query_id='1185869', doc_id='0', relevance=1, iteration='0')
# TrecQrel(query_id='1185868', doc_id='16', relevance=1, iteration='0')
# ...
# Look up queries and documents by ID
queries_store = dataset.queries_store()
queries_store.get("1185868")
# GenericQuery(query_id='1185868', text='_________ justice is designed to repair the harm to victim, the comm...
dataset = ir_datasets.wrappers.DocstoreWrapper(dataset)
doc_store = dataset.docs_store()
doc_store.get("16")
# GenericDoc(doc_id='16', text='The approach is based on a theory of justice that considers crime and wrongdoi...
If you want to use your own dataset, you can construct an object with the same interface as the standard benchmarks by:
import ir_datasets
dataset = ir_datasets.create_dataset(
docs_tsv="path/to/docs.tsv",
queries_tsv="path/to/queries.tsv",
qrels_trec="path/to/qrels.trec"
)
Here, documents and queries are represented in TSV format with format [id]\t[text]
. Query
relevance judgments are provided in the standard TREC format:
[query_id] [iteration] [doc_id] [rel]
.
Command Line Interface
Export data in various formats:
$ ir_datasets export [dataset-id] [docs/queries/qrels/scoreddocs/docpairs]
$ ir_datasets export msmarco-passage/train docs | head -n2
0 The presence of communication amid scientific minds was equally important to the success of the Manh...
1 The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...
$ ir_datasets export msmarco-passage/train docs --format jsonl | head -n2
{"doc_id": "0", "text": "The presence of communication amid scientific minds was equally important to the su...
{"doc_id": "1", "text": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its ...
--format
specifies the output format (e.g., tsv or jsonl). --fields
specifies which fields to
include in the output. This depends on what fields are available in the dataset, but most try to
include common fields, e.g., text
in documents returns the text of the document, without any
markup.
Look up documents and queries by ID:
$ ir_datasets lookup [dataset-id] [--qid] [ids...]
--format
and --fields
also work here. --qid
indicates that queries should be looked up instead
of documents (default).
This is much faster than using ir_datasets export ... | grep
(or similar) because it indexes the
documents/queries by ID.
Datasets
Available datasets include (each of which containing subsets):
antique
car-v1.5
cord19
msmarco-document
msmarco-passage
nyt
trec-arabic
trec-mandarin
trec-robust04
trec-spanish
See the datasets documentation page for details about each dataset, its available subsets, and what data they provide.
Citing
When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset
can be found on the datasets documentation page, or in the python
interface via dataset.bibtex()
(when available).
The ir_datasets
package was released as part of ABNIRML, so
please cite the following if you use this package:
@article{macavaney:arxiv2020-abnirml,
author = {MacAvaney, Sean and Feldman, Sergey and Goharian, Nazli and Downey, Doug and Cohan, Arman},
title = {ABNIRML: Analyzing the Behavior of Neural IR Models},
year = {2020},
url = {https://arxiv.org/abs/2011.00696},
journal = {arXiv},
volume = {abs/2011.00696}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ir_datasets-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ecb574cd7db3ddad19330a8b5f6357769718809ca5f8c00445a699404a78e7b |
|
MD5 | b88ffa9632ab486b0364def82951a894 |
|
BLAKE2b-256 | 36fc29204b25f7d2ba71792e5dfb4ea4651135dff23014b2ca47a9935ef4a71a |