Skip to main content

A Flexible Toolkit for Dense Retrieval

Project description

Trove

A Flexible Toolkit for Dense Retrieval


Trove Logo


Trove is a lightweight toolkit for training and evaluating transformer-based dense retrievers. It aims to keep the codebase simple and hackable, while offering a clean, unified interface for quickly experimenting with new ideas.

Key features:

  • Well-documented and easy-to-understand codebase
  • Simple, modular design that's easy to extend and integrate into different workflows
  • Minimal, consistent interface for evaluation and hard negative mining
  • Built to work seamlessly with the Hugging Face ecosystem (e.g., PEFT methods, distributed training/inference)
  • Effortless manipulation and combination of multiple datasets on-the-fly

Quick Tour

Install Trove using pip:

pip install git+https://github.com/BatsResearch/trove

Training

Train with binary labels:

from transformers import AutoTokenizer, HfArgumentParser
from trove import *

parser = HfArgumentParser((RetrievalTrainingArguments, ModelArguments, DataArguments))
train_args, model_args, data_args = parser.parse_args_into_dataclasses()

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
model = BiEncoderRetriever.from_model_args(args=model_args)

pos = MaterializedQRelConfig(min_score=1, qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
neg = MaterializedQRelConfig(max_score=1, qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = BinaryDataset(data_args=data_args, positive_configs=pos, negative_configs=neg, format_query=model.format_query, format_passage=model.format_passage)
data_collator = RetrievalCollator(data_args=data_args, tokenizer=tokenizer, append_eos=model.append_eos_token)

trainer = RetrievalTrainer(args=train_args, model=model, tokenizer=tokenizer, data_collator=data_collator, train_dataset=dataset)
trainer.train()

To train with graduated relevance labels (e.g., {0, 1, 2, 3}), you just need to change a few lines:

...
conf = MaterializedQRelConfig(qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = MultiLevelDataset(data_args=data_args, qrel_config=conf, format_query=model.format_query, format_passage=model.format_passage)
...

Inference

Evaluation: Calculate IR metrics

from transformers import AutoTokenizer, HfArgumentParser
from trove import *

parser = HfArgumentParser((EvaluationArguments, ModelArguments, DataArguments))
eval_args, model_args, data_args = parser.parse_args_into_dataclasses()

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
model = BiEncoderRetriever.from_model_args(args=model_args)

conf = MaterializedQRelConfig(qrel_path="test_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = MultiLevelDataset(data_args=data_args, qrel_config=conf, format_query=model.format_query, format_passage=model.format_passage)
data_collator = RetrievalCollator(data_args=data_args, tokenizer=tokenizer, append_eos=model.append_eos_token)

evaluator = RetrievalEvaluator(args=eval_args, model=model, tokenizer=tokenizer, data_collator=data_collator, eval_dataset=dataset)
evaluator.evaluate()

Hard Negative Mining: With very minor changes, you can use the above snippet to mine hard negatives for the given queries. You only need to change the last line:

...
evaluator.mine_hard_negatives()
...

Distributed Environments

Trove supports both training and inference in multi-gpu and multi-node environments. You just need to run your scripts with a distributed launcher.

accelerate launch --multi_gpu {train.py or eval.py} {script arguments}

You can also use deepspeed for training. Since Trove wraps around and is fully compatible with huggingface transformers, you just need to pass your deepspeed config file as a command line argument. See huggingface transformers documentation for more details.

Citation

If you use this software, please cite us.

@misc{esfandiarpoortrove,
  author = {Reza Esfandiarpoor and Stephen H. Bach},
  title = {Trove: A Flexible Toolkit for Dense Retrieval},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BatsResearch/trove}}
}

Acknowledgment

Some of the high-level design choices are inspired by Tevatron library. Trove also adapts some implementation details of Tevatron. Some data manipulations are inspired by ideas in Huggingface datasets source code.

This material is based upon work supported by the National Science Foundation under Grant No. RISE-2425380. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Disclosure: Stephen Bach is an advisor to Snorkel AI, a company that provides software and services for data-centric artificial intelligence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_trove-0.0.1.tar.gz (123.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ir_trove-0.0.1-py3-none-any.whl (153.4 kB view details)

Uploaded Python 3

File details

Details for the file ir_trove-0.0.1.tar.gz.

File metadata

  • Download URL: ir_trove-0.0.1.tar.gz
  • Upload date:
  • Size: 123.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ir_trove-0.0.1.tar.gz
Algorithm Hash digest
SHA256 f189f91e4495e517efcaac4ba9adadeabb0c9e01f8d824bdfa6ad3fc53afd560
MD5 5b97e335f4f392c8205647a3b42e8c7d
BLAKE2b-256 1f1c11dacd9d5120ed53de9dc7c64bd05dc8f164094cfacccea36ea95a3e2b61

See more details on using hashes here.

Provenance

The following attestation bundles were made for ir_trove-0.0.1.tar.gz:

Publisher: make_release.yaml on BatsResearch/trove

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ir_trove-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ir_trove-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 153.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ir_trove-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 66dd79489e473b9af04e0577fa1d2bfee3f790ed9883295c2b36b65e1dadf322
MD5 cee7f54d38585f316c4e3c27e34466c9
BLAKE2b-256 371adec4db69afdd58310c0269b8b95b71d0b7abb558b322ebe872338f98ca18

See more details on using hashes here.

Provenance

The following attestation bundles were made for ir_trove-0.0.1-py3-none-any.whl:

Publisher: make_release.yaml on BatsResearch/trove

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page