Datamaestro module for Information Retrieval datasets
Project description
Information Retrieval Datasets
This datamaestro plugin provides easy and systematic access to information retrieval datasets. It handles automated downloading and preparation of standard IR collections, exposes them through a typed Python API, and includes efficient document stores for fast text access (file, mmap, or in-memory).
Full documentation: datamaestro-ir.readthedocs.io
Available Datasets
Ad-hoc Retrieval
- TREC Ad-hoc (1–8), Robust 2004/2005 — classic TREC test collections over TIPSTER/AQUAINT corpora
- BEIR Benchmark — 15+ datasets: TrecCovid, NQ, ArguAna, Touché, ClimateFever, SciDocs, NFCorpus, HotpotQA, FiQA, Quora, DBpedia-Entity, FEVER, SciFact, CQADupStack (12 sub-forums)
- LoTTE — domain-specific retrieval across 6 domains (lifestyle, recreation, science, technology, writing, pooled) × dev/test × search/forum queries
- MS MARCO Passage & Document — passage ranking (8.8M passages) and document ranking (v1: 3.2M, v2: 12M documents)
- CORD-19 / TREC-COVID — COVID-19 research article retrieval (192K documents)
Conversational Search
- TREC CaST 2019–2022 — conversational passage retrieval with decontextualized queries, tree-structured conversations (2022), and segmented passages
- iKAT 2023–2025 — interactive knowledge-seeking over ClueWeb22
Query Rewriting
- CANARD — context-aware query rewriting (train/dev/test)
- QReCC — question rewriting in conversational context (14K conversations, 81K QA pairs)
- OrConvQA — open-retrieval conversational QA over 11M Wikipedia passages
Knowledge Distillation & Training Data
- MS MARCO Ensemble/BERT Teacher — 40M triples with teacher scores
- rank-distillm — BM25/ColBERTv2/RankZephyr annotated passages
- MS MARCO Hard Negatives — hard negatives mined from multiple retrieval models
- Neural Ranking KD — knowledge distillation teacher scores
Base Document Collections
- TIPSTER (AP, FT, WSJ, ZIFF, …), AQUAINT, TREC CAR (29.8M paragraphs), WAPO v2/v4, KILT (42M Wikipedia articles)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamaestro_ir-0.2.1.tar.gz.
File metadata
- Download URL: datamaestro_ir-0.2.1.tar.gz
- Upload date:
- Size: 65.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a0fc7279738dc0b703eeb7c02c790ed9cb0affe2d51ee2633cb76bbb7174e72
|
|
| MD5 |
1e2ffead4bc3fa44d711f863d609136d
|
|
| BLAKE2b-256 |
d5cac2bfed6daa23fafd4b7d3e0c2af3b6a21c6b978d9f4d53b8c9cb7eb2cf63
|
Provenance
The following attestation bundles were made for datamaestro_ir-0.2.1.tar.gz:
Publisher:
python-publish.yml on xpmir/datamaestro_ir
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamaestro_ir-0.2.1.tar.gz -
Subject digest:
6a0fc7279738dc0b703eeb7c02c790ed9cb0affe2d51ee2633cb76bbb7174e72 - Sigstore transparency entry: 1225519737
- Sigstore integration time:
-
Permalink:
xpmir/datamaestro_ir@9fba504bc3446dd6ad41ef0c63b41ca0f6033812 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/xpmir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@9fba504bc3446dd6ad41ef0c63b41ca0f6033812 -
Trigger Event:
release
-
Statement type:
File details
Details for the file datamaestro_ir-0.2.1-py3-none-any.whl.
File metadata
- Download URL: datamaestro_ir-0.2.1-py3-none-any.whl
- Upload date:
- Size: 93.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
776abbd7e36c3a80bb9c8f01507435085c4a1093592eeb07dae0129de2eac851
|
|
| MD5 |
59edc2d5a4876617078289f10de35485
|
|
| BLAKE2b-256 |
3d3360e86ff5800855fde1021454844a81587d911715e81e7dff48ad734734e9
|
Provenance
The following attestation bundles were made for datamaestro_ir-0.2.1-py3-none-any.whl:
Publisher:
python-publish.yml on xpmir/datamaestro_ir
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamaestro_ir-0.2.1-py3-none-any.whl -
Subject digest:
776abbd7e36c3a80bb9c8f01507435085c4a1093592eeb07dae0129de2eac851 - Sigstore transparency entry: 1225519833
- Sigstore integration time:
-
Permalink:
xpmir/datamaestro_ir@9fba504bc3446dd6ad41ef0c63b41ca0f6033812 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/xpmir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@9fba504bc3446dd6ad41ef0c63b41ca0f6033812 -
Trigger Event:
release
-
Statement type: