Skip to main content

provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.

Project description

ir_datasets

ir_datasets is a python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. It was built as a fork of OpenNIR to allow easier integration with other systems.

The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.

ir_datasets provides a common iterator format to allow them to be easily used in python. It attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to allow quick lookups of documents by ID.

A command line interface is also available.

You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!

Getting Started

Install locally with:

$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl

Tested with python versions 3.6 and 3.7

Python Interface

Load a dataset, such as the MS-MARCO passage ranking datset, using:

import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')

A dataset object lets you iterate through supported properties like docs (dataset.docs_iter()), queries (dataset.queries_iter()), and relevance judgments (dataset.qrels_iter()). Each iterator yields namedtuples, with fields based on the available data.

# Documents
for doc in dataset.docs_iter():
    print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...

# Queries
for query in dataset.queries_iter():
    print(query)
# GenericQuery(query_id='121352', text='define extreme')                                          
# GenericQuery(query_id='634306', text='what does chattel mean on credit history')
# ...

# Query relevance judgments (qrels)
for qrel in dataset.qrels_iter():
    print(qrels)
# TrecQrel(query_id='1185869', doc_id='0', relevance=1, iteration='0')
# TrecQrel(query_id='1185868', doc_id='16', relevance=1, iteration='0')
# ...

# Look up queries and documents by ID
queries_store = dataset.queries_store()
queries_store.get("1185868")
# GenericQuery(query_id='1185868', text='_________ justice is designed to repair the harm to victim, the comm...

dataset = ir_datasets.wrappers.DocstoreWrapper(dataset)
doc_store = dataset.docs_store()
doc_store.get("16")
# GenericDoc(doc_id='16', text='The approach is based on a theory of justice that considers crime and wrongdoi...

If you want to use your own dataset, you can construct an object with the same interface as the standard benchmarks by:

import ir_datasets
dataset = ir_datasets.create_dataset(
  docs_tsv="path/to/docs.tsv",
  queries_tsv="path/to/queries.tsv",
  qrels_trec="path/to/qrels.trec"
)

Here, documents and queries are represented in TSV format with format [id]\t[text]. Query relevance judgments are provided in the standard TREC format: [query_id] [iteration] [doc_id] [rel].

Command Line Interface

Export data in various formats:

$ ir_datasets export [dataset-id] [docs/queries/qrels/scoreddocs/docpairs]

$ ir_datasets export msmarco-passage/train docs | head -n2
0	The presence of communication amid scientific minds was equally important to the success of the Manh...
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...

$ ir_datasets export msmarco-passage/train docs --format jsonl | head -n2
{"doc_id": "0", "text": "The presence of communication amid scientific minds was equally important to the su...
{"doc_id": "1", "text": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its ...

--format specifies the output format (e.g., tsv or jsonl). --fields specifies which fields to include in the output. This depends on what fields are available in the dataset, but most try to include common fields, e.g., text in documents returns the text of the document, without any markup.

Look up documents and queries by ID:

$ ir_datasets lookup [dataset-id] [--qid] [ids...]

--format and --fields also work here. --qid indicates that queries should be looked up instead of documents (default).

This is much faster than using ir_datasets export ... | grep (or similar) because it indexes the documents/queries by ID.

Datasets

Available datasets include (each of which containing subsets):

  • antique
  • car-v1.5
  • cord19
  • msmarco-document
  • msmarco-passage
  • nyt
  • trec-arabic
  • trec-mandarin
  • trec-robust04
  • trec-spanish

See the datasets documentation page for details about each dataset, its available subsets, and what data they provide.

Citing

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on the datasets documentation page, or in the python interface via dataset.bibtex() (when available).

The ir_datasets package was released as part of ABNIRML, so please cite the following if you use this package:

@article{macavaney:arxiv2020-abnirml,
  author = {MacAvaney, Sean and Feldman, Sergey and Goharian, Nazli and Downey, Doug and Cohan, Arman},
  title = {ABNIRML: Analyzing the Behavior of Neural IR Models},
  year = {2020},
  url = {https://arxiv.org/abs/2011.00696},
  journal = {arXiv},
  volume = {abs/2011.00696}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets-0.0.1.tar.gz (50.5 kB view hashes)

Uploaded Source

Built Distribution

ir_datasets-0.0.1-py3-none-any.whl (70.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page