Skip to main content

provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.

Project description

ir_datasets

ir_datasets is a python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. It was built as a fork of OpenNIR to allow easier integration with other systems.

The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.

ir_datasets provides a common iterator format to allow them to be easily used in python. It attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to allow quick lookups of documents by ID.

A command line interface is also available.

You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!

Getting Started

Install locally with:

$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl

Tested with python versions 3.6 and 3.7

Python Interface

Load a dataset, such as the MS-MARCO passage ranking datset, using:

import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')

A dataset object lets you iterate through supported properties like docs (dataset.docs_iter()), queries (dataset.queries_iter()), and relevance judgments (dataset.qrels_iter()). Each iterator yields namedtuples, with fields based on the available data.

# Documents
for doc in dataset.docs_iter():
    print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...

# Queries
for query in dataset.queries_iter():
    print(query)
# GenericQuery(query_id='121352', text='define extreme')                                          
# GenericQuery(query_id='634306', text='what does chattel mean on credit history')
# ...

# Query relevance judgments (qrels)
for qrel in dataset.qrels_iter():
    print(qrels)
# TrecQrel(query_id='1185869', doc_id='0', relevance=1, iteration='0')
# TrecQrel(query_id='1185868', doc_id='16', relevance=1, iteration='0')
# ...

# Look up queries and documents by ID
queries_store = dataset.queries_store()
queries_store.get("1185868")
# GenericQuery(query_id='1185868', text='_________ justice is designed to repair the harm to victim, the comm...

dataset = ir_datasets.wrappers.DocstoreWrapper(dataset)
doc_store = dataset.docs_store()
doc_store.get("16")
# GenericDoc(doc_id='16', text='The approach is based on a theory of justice that considers crime and wrongdoi...

If you want to use your own dataset, you can construct an object with the same interface as the standard benchmarks by:

import ir_datasets
dataset = ir_datasets.create_dataset(
  docs_tsv="path/to/docs.tsv",
  queries_tsv="path/to/queries.tsv",
  qrels_trec="path/to/qrels.trec"
)

Here, documents and queries are represented in TSV format with format [id]\t[text]. Query relevance judgments are provided in the standard TREC format: [query_id] [iteration] [doc_id] [rel].

Command Line Interface

Export data in various formats:

$ ir_datasets export [dataset-id] [docs/queries/qrels/scoreddocs/docpairs]

$ ir_datasets export msmarco-passage/train docs | head -n2
0	The presence of communication amid scientific minds was equally important to the success of the Manh...
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...

$ ir_datasets export msmarco-passage/train docs --format jsonl | head -n2
{"doc_id": "0", "text": "The presence of communication amid scientific minds was equally important to the su...
{"doc_id": "1", "text": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its ...

--format specifies the output format (e.g., tsv or jsonl). --fields specifies which fields to include in the output. This depends on what fields are available in the dataset, but most try to include common fields, e.g., text in documents returns the text of the document, without any markup.

Look up documents and queries by ID:

$ ir_datasets lookup [dataset-id] [--qid] [ids...]

--format and --fields also work here. --qid indicates that queries should be looked up instead of documents (default).

This is much faster than using ir_datasets export ... | grep (or similar) because it indexes the documents/queries by ID.

Datasets

Available datasets include (each of which containing subsets):

  • antique
  • car-v1.5
  • cord19
  • msmarco-document
  • msmarco-passage
  • nyt
  • trec-arabic
  • trec-mandarin
  • trec-robust04
  • trec-spanish

See the datasets documentation page for details about each dataset, its available subsets, and what data they provide.

Citing

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on the datasets documentation page, or in the python interface via dataset.bibtex() (when available).

The ir_datasets package was released as part of ABNIRML, so please cite the following if you use this package:

@article{macavaney:arxiv2020-abnirml,
  author = {MacAvaney, Sean and Feldman, Sergey and Goharian, Nazli and Downey, Doug and Cohan, Arman},
  title = {ABNIRML: Analyzing the Behavior of Neural IR Models},
  year = {2020},
  url = {https://arxiv.org/abs/2011.00696},
  journal = {arXiv},
  volume = {abs/2011.00696}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets-0.0.1.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ir_datasets-0.0.1-py3-none-any.whl (70.8 kB view details)

Uploaded Python 3

File details

Details for the file ir_datasets-0.0.1.tar.gz.

File metadata

  • Download URL: ir_datasets-0.0.1.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12

File hashes

Hashes for ir_datasets-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4734b4b2ffc70998497ff0f6910311f1ceaabaccdf3ef634805fc0a28c194949
MD5 6e37190be5f72d5454956cf45d8e8bc3
BLAKE2b-256 54f99bcb59621614d516eaf8882efd459ab1e5801344272caf0b8d2d8b99c507

See more details on using hashes here.

File details

Details for the file ir_datasets-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ir_datasets-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 70.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.12

File hashes

Hashes for ir_datasets-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ecb574cd7db3ddad19330a8b5f6357769718809ca5f8c00445a699404a78e7b
MD5 b88ffa9632ab486b0364def82951a894
BLAKE2b-256 36fc29204b25f7d2ba71792e5dfb4ea4651135dff23014b2ca47a9935ef4a71a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page