Skip to main content

Datamaestro module for Information Retrieval datasets

Project description

pre-commit PyPI version

Information Retrieval Datasets

This datamaestro plugin provides easy and systematic access to information retrieval datasets. It handles automated downloading and preparation of standard IR collections, exposes them through a typed Python API, and includes efficient document stores for fast text access (file, mmap, or in-memory).

Full documentation: datamaestro-ir.readthedocs.io

Available Datasets

Ad-hoc Retrieval

  • TREC Ad-hoc (1–8), Robust 2004/2005 — classic TREC test collections over TIPSTER/AQUAINT corpora
  • BEIR Benchmark — 15+ datasets: TrecCovid, NQ, ArguAna, Touché, ClimateFever, SciDocs, NFCorpus, HotpotQA, FiQA, Quora, DBpedia-Entity, FEVER, SciFact, CQADupStack (12 sub-forums)
  • LoTTE — domain-specific retrieval across 6 domains (lifestyle, recreation, science, technology, writing, pooled) × dev/test × search/forum queries
  • MS MARCO Passage & Document — passage ranking (8.8M passages) and document ranking (v1: 3.2M, v2: 12M documents)
  • CORD-19 / TREC-COVID — COVID-19 research article retrieval (192K documents)

Conversational Search

  • TREC CaST 2019–2022 — conversational passage retrieval with decontextualized queries, tree-structured conversations (2022), and segmented passages
  • iKAT 2023–2025 — interactive knowledge-seeking over ClueWeb22

Query Rewriting

  • CANARD — context-aware query rewriting (train/dev/test)
  • QReCC — question rewriting in conversational context (14K conversations, 81K QA pairs)
  • OrConvQA — open-retrieval conversational QA over 11M Wikipedia passages

Knowledge Distillation & Training Data

  • MS MARCO Ensemble/BERT Teacher — 40M triples with teacher scores
  • rank-distillm — BM25/ColBERTv2/RankZephyr annotated passages
  • MS MARCO Hard Negatives — hard negatives mined from multiple retrieval models
  • Neural Ranking KD — knowledge distillation teacher scores
  • LightOn embeddings-pre-training — 73-config variant family ai.lighton.embeddings_pre_training[...] + DenseON-LateON mGTE recipe ...denseon_lateon.

Base Document Collections

  • TIPSTER (AP, FT, WSJ, ZIFF, …), AQUAINT, TREC CAR (29.8M paragraphs), WAPO v2/v4, KILT (42M Wikipedia articles)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaestro_ir-0.4.1.tar.gz (87.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamaestro_ir-0.4.1-py3-none-any.whl (121.0 kB view details)

Uploaded Python 3

File details

Details for the file datamaestro_ir-0.4.1.tar.gz.

File metadata

  • Download URL: datamaestro_ir-0.4.1.tar.gz
  • Upload date:
  • Size: 87.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro_ir-0.4.1.tar.gz
Algorithm Hash digest
SHA256 0a0e3c0723483dc6623bb613aae606e0c2c073fbfac0a8e8e63b5cbec0c2f7ac
MD5 be9af83dff56d639dadba8687bdf423c
BLAKE2b-256 e35954b5993f1532590ad50a6dab66f602e025a0df52d4ce8a73bf86d9f9c28a

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro_ir-0.4.1.tar.gz:

Publisher: python-publish.yml on xpmir/datamaestro_ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datamaestro_ir-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: datamaestro_ir-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 121.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamaestro_ir-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cff6333f7c828af21a119c638ffb1df128e992c10182ba501ead6dcffaae0fc4
MD5 3f7721c9312d8cabea4df8a1cb4b31fd
BLAKE2b-256 2f374fec9b9fca46e9ccc7cd60ed966179f650fa8ec080ecea2f3e617feb3b42

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamaestro_ir-0.4.1-py3-none-any.whl:

Publisher: python-publish.yml on xpmir/datamaestro_ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page