provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ir_datasets

ir_datasets is a python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. It was built as a fork of OpenNIR to allow easier integration with other systems.

The package takes care of downloading datasets (including documents, queries, relevance judgments, etc.) when available from public sources. Instructions on how to obtain datasets are provided when they are not publicly available.

ir_datasets provides a common iterator format to allow them to be easily used in python. It attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to allow quick lookups of documents by ID.

A command line interface is also available.

You can find a list of datasets and their features here. Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request!

Getting Started

Install locally with:

$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl

Tested with python versions 3.6 and 3.7

Python Interface

Load a dataset, such as the MS-MARCO passage ranking datset, using:

import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')

A dataset object lets you iterate through supported properties like docs (dataset.docs_iter()), queries (dataset.queries_iter()), and relevance judgments (dataset.qrels_iter()). Each iterator yields namedtuples, with fields based on the available data.

# Documents
for doc in dataset.docs_iter():
    print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...

# Queries
for query in dataset.queries_iter():
    print(query)
# GenericQuery(query_id='121352', text='define extreme')                                          
# GenericQuery(query_id='634306', text='what does chattel mean on credit history')
# ...

# Query relevance judgments (qrels)
for qrel in dataset.qrels_iter():
    print(qrels)
# TrecQrel(query_id='1185869', doc_id='0', relevance=1, iteration='0')
# TrecQrel(query_id='1185868', doc_id='16', relevance=1, iteration='0')
# ...

# Look up queries and documents by ID
queries_store = dataset.queries_store()
queries_store.get("1185868")
# GenericQuery(query_id='1185868', text='_________ justice is designed to repair the harm to victim, the comm...

dataset = ir_datasets.wrappers.DocstoreWrapper(dataset)
doc_store = dataset.docs_store()
doc_store.get("16")
# GenericDoc(doc_id='16', text='The approach is based on a theory of justice that considers crime and wrongdoi...

If you want to use your own dataset, you can construct an object with the same interface as the standard benchmarks by:

import ir_datasets
dataset = ir_datasets.create_dataset(
  docs_tsv="path/to/docs.tsv",
  queries_tsv="path/to/queries.tsv",
  qrels_trec="path/to/qrels.trec"
)

Here, documents and queries are represented in TSV format with format [id]\t[text]. Query relevance judgments are provided in the standard TREC format: [query_id] [iteration] [doc_id] [rel].

Command Line Interface

Export data in various formats:

$ ir_datasets export [dataset-id] [docs/queries/qrels/scoreddocs/docpairs]

$ ir_datasets export msmarco-passage/train docs | head -n2
0	The presence of communication amid scientific minds was equally important to the success of the Manh...
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...

$ ir_datasets export msmarco-passage/train docs --format jsonl | head -n2
{"doc_id": "0", "text": "The presence of communication amid scientific minds was equally important to the su...
{"doc_id": "1", "text": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its ...

--format specifies the output format (e.g., tsv or jsonl). --fields specifies which fields to include in the output. This depends on what fields are available in the dataset, but most try to include common fields, e.g., text in documents returns the text of the document, without any markup.

Look up documents and queries by ID:

$ ir_datasets lookup [dataset-id] [--qid] [ids...]

--format and --fields also work here. --qid indicates that queries should be looked up instead of documents (default).

This is much faster than using ir_datasets export ... | grep (or similar) because it indexes the documents/queries by ID.

Datasets

Available datasets include (each of which containing subsets):

antique
car-v1.5
cord19
msmarco-document
msmarco-passage
nyt
trec-arabic
trec-mandarin
trec-robust04
trec-spanish

See the datasets documentation page for details about each dataset, its available subsets, and what data they provide.

Citing

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset can be found on the datasets documentation page, or in the python interface via dataset.bibtex() (when available).

The ir_datasets package was released as part of ABNIRML, so please cite the following if you use this package:

@article{macavaney:arxiv2020-abnirml,
  author = {MacAvaney, Sean and Feldman, Sergey and Goharian, Nazli and Downey, Doug and Cohan, Arman},
  title = {ABNIRML: Analyzing the Behavior of Neural IR Models},
  year = {2020},
  url = {https://arxiv.org/abs/2011.00696},
  journal = {arXiv},
  volume = {abs/2011.00696}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.6

Feb 4, 2024

0.5.5

Jun 15, 2023

0.5.4

Oct 18, 2022

0.5.3

Jul 18, 2022

0.5.2

Jun 14, 2022

0.5.1

Mar 31, 2022

0.5.0

Nov 25, 2021

0.4.3

Sep 17, 2021

0.4.2

Jul 16, 2021

0.4.1

Jul 3, 2021

0.4.0

Jun 4, 2021

0.3.3

Mar 15, 2021

0.3.2 yanked

Mar 12, 2021

Reason this release was yanked:

critical bug

0.3.1

Mar 2, 2021

0.2.0

Feb 16, 2021

0.1.7

Feb 15, 2021

This version

0.0.1

Nov 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_datasets-0.0.1.tar.gz (50.5 kB view hashes)

Uploaded Nov 11, 2020 Source

Built Distribution

ir_datasets-0.0.1-py3-none-any.whl (70.8 kB view hashes)

Uploaded Nov 11, 2020 Python 3

Hashes for ir_datasets-0.0.1.tar.gz

Hashes for ir_datasets-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4734b4b2ffc70998497ff0f6910311f1ceaabaccdf3ef634805fc0a28c194949`
MD5	`6e37190be5f72d5454956cf45d8e8bc3`
BLAKE2b-256	`54f99bcb59621614d516eaf8882efd459ab1e5801344272caf0b8d2d8b99c507`

Hashes for ir_datasets-0.0.1-py3-none-any.whl

Hashes for ir_datasets-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ecb574cd7db3ddad19330a8b5f6357769718809ca5f8c00445a699404a78e7b`
MD5	`b88ffa9632ab486b0364def82951a894`
BLAKE2b-256	`36fc29204b25f7d2ba71792e5dfb4ea4651135dff23014b2ca47a9935ef4a71a`