A Flexible Toolkit for Dense Retrieval

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

reza-esfandiarpoor

These details have not been verified by PyPI

Project description

Trove

A Flexible Toolkit for Dense Retrieval

Trove Logo

Trove is a lightweight toolkit for training and evaluating transformer-based dense retrievers. It aims to keep the codebase simple and hackable, while offering a clean, unified interface for quickly experimenting with new ideas.

Key features:

Well-documented and easy-to-understand codebase
Simple, modular design that's easy to extend and integrate into different workflows
Minimal, consistent interface for evaluation and hard negative mining
Built to work seamlessly with the Hugging Face ecosystem (e.g., PEFT methods, distributed training/inference)
Effortless manipulation and combination of multiple datasets on-the-fly

🎓 Documentation

📚 Examples

⭐ Check out our recent paper (and code) to see how Trove's data manipulation capabilities enable us to train retrievers with synthetic multi-level ranking contexts.

Quick Tour

Install Trove from PyPI:

pip install ir-trove

To get the latest changes, install from source:

pip install git+https://github.com/BatsResearch/trove

Training

Documentation

Train with binary labels:

from transformers import AutoTokenizer, HfArgumentParser
from trove import *

parser = HfArgumentParser((RetrievalTrainingArguments, ModelArguments, DataArguments))
train_args, model_args, data_args = parser.parse_args_into_dataclasses()

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
model = BiEncoderRetriever.from_model_args(args=model_args)

pos = MaterializedQRelConfig(min_score=1, qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
neg = MaterializedQRelConfig(max_score=1, qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = BinaryDataset(data_args=data_args, positive_configs=pos, negative_configs=neg, format_query=model.format_query, format_passage=model.format_passage)
data_collator = RetrievalCollator(data_args=data_args, tokenizer=tokenizer, append_eos=model.append_eos_token)

trainer = RetrievalTrainer(args=train_args, model=model, tokenizer=tokenizer, data_collator=data_collator, train_dataset=dataset)
trainer.train()

To train with graduated relevance labels (e.g., {0, 1, 2, 3}), you just need to change a few lines:

...
conf = MaterializedQRelConfig(qrel_path="train_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = MultiLevelDataset(data_args=data_args, qrel_config=conf, format_query=model.format_query, format_passage=model.format_passage)
...

Data Manipulation

Documentation

Manipulate and combine multiple data sources with just a few lines of code. The following snippet combines a multi-level synthetic dataset (with labels {0, 1, 2, 3}) with real annotated positives and two mined hard negatives per query. Before merging, it also reassigns the label values: real positives are labeled 3, and mined negatives are labeled 1.

...
real_pos = MaterializedQRelConfig(min_score=1, score_transform=3, corpus_path="real_corpus.jsonl", qrel_path="qrels/train.tsv", query_path="queries.jsonl")
mined_neg = MaterializedQRelConfig(group_random_k=2, score_transform=1, corpus_path="real_corpus.jsonl", qrel_path="mined_qrel.tsv", query_path="queries.jsonl")
synth_data = MaterializedQRelConfig(corpus_path="corpus_multilevel_synth.jsonl", qrel_path="qrel_multilevel_synth.tsv", query_path="queries.jsonl")

dataset = MultiLevelDataset(qrel_config=[real_pos, mined_neg, synth_data], data_args=data_args, format_query=model.format_query, format_passage=model.format_passage)
...

Inference

Documentation

Evaluation: Calculate IR metrics

from transformers import AutoTokenizer, HfArgumentParser
from trove import *

parser = HfArgumentParser((EvaluationArguments, ModelArguments, DataArguments))
eval_args, model_args, data_args = parser.parse_args_into_dataclasses()

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
model = BiEncoderRetriever.from_model_args(args=model_args)

conf = MaterializedQRelConfig(qrel_path="test_qrel.tsv", corpus_path="corpus.jsonl", query_path="queries.jsonl")
dataset = MultiLevelDataset(data_args=data_args, qrel_config=conf, format_query=model.format_query, format_passage=model.format_passage)
data_collator = RetrievalCollator(data_args=data_args, tokenizer=tokenizer, append_eos=model.append_eos_token)

evaluator = RetrievalEvaluator(args=eval_args, model=model, tokenizer=tokenizer, data_collator=data_collator, eval_dataset=dataset)
evaluator.evaluate()

Hard Negative Mining: With very minor changes, you can use the above snippet to mine hard negatives for the given queries. You only need to change the last line:

...
evaluator.mine_hard_negatives()
...

Distributed Environments

Trove supports both training and inference in multi-gpu and multi-node environments. You just need to run your scripts with a distributed launcher.

accelerate launch --multi_gpu {train.py or eval.py} {script arguments}

You can also use deepspeed for training. Since Trove wraps around and is fully compatible with huggingface transformers, you just need to pass your deepspeed config file as a command line argument. See huggingface transformers documentation for more details.

Citation

If you use this software, please cite us.

@misc{esfandiarpoortrove,
  author = {Reza Esfandiarpoor and Stephen H. Bach},
  title = {Trove: A Flexible Toolkit for Dense Retrieval},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BatsResearch/trove}}
}

Acknowledgment

Some of the high-level design choices are inspired by Tevatron library. Trove also adapts some implementation details of Tevatron. Some data manipulations are inspired by ideas in Huggingface datasets source code.

This material is based upon work supported by the National Science Foundation under Grant No. RISE-2425380. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Disclosure: Stephen Bach is an advisor to Snorkel AI, a company that provides software and services for data-centric artificial intelligence.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

reza-esfandiarpoor

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.2

May 29, 2025

0.0.1

Apr 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ir_trove-0.0.2.tar.gz (123.6 kB view details)

Uploaded May 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ir_trove-0.0.2-py3-none-any.whl (153.9 kB view details)

Uploaded May 29, 2025 Python 3

File details

Details for the file ir_trove-0.0.2.tar.gz.

File metadata

Download URL: ir_trove-0.0.2.tar.gz
Upload date: May 29, 2025
Size: 123.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ir_trove-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`68d180fe4cf38a2274f99f3d3f7c9cfcb9361a05c0e5f9e7446ec26863fc5b12`
MD5	`d261118eb902b315d21a42c706d4031f`
BLAKE2b-256	`651a14bb30b9239b58da38f07b6e587a6c77ffc26eda86b0a139758b690bdcc3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ir_trove-0.0.2.tar.gz:

Publisher: make_release.yaml on BatsResearch/trove

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ir_trove-0.0.2.tar.gz
- Subject digest: 68d180fe4cf38a2274f99f3d3f7c9cfcb9361a05c0e5f9e7446ec26863fc5b12
- Sigstore transparency entry: 223861471
- Sigstore integration time: May 29, 2025
Source repository:
- Permalink: BatsResearch/trove@f2a2a4b6cd5c6ab4a73dd03d1c637b303155ddca
- Branch / Tag: refs/tags/0.0.2
- Owner: https://github.com/BatsResearch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: make_release.yaml@f2a2a4b6cd5c6ab4a73dd03d1c637b303155ddca
- Trigger Event: push

File details

Details for the file ir_trove-0.0.2-py3-none-any.whl.

File metadata

Download URL: ir_trove-0.0.2-py3-none-any.whl
Upload date: May 29, 2025
Size: 153.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ir_trove-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bfa526a3124a328eb0bf7113b6dc9526e11e7dc068931bd75fae2736c7bb01b`
MD5	`3b60fa4eb4494414cb5cc6859f68b031`
BLAKE2b-256	`95aa90cdde1b2a0d9e3d265ca204b85d1599fd3f0e9164a5e0a7d91fb8898c78`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ir_trove-0.0.2-py3-none-any.whl:

Publisher: make_release.yaml on BatsResearch/trove

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ir_trove-0.0.2-py3-none-any.whl
- Subject digest: 3bfa526a3124a328eb0bf7113b6dc9526e11e7dc068931bd75fae2736c7bb01b
- Sigstore transparency entry: 223861478
- Sigstore integration time: May 29, 2025
Source repository:
- Permalink: BatsResearch/trove@f2a2a4b6cd5c6ab4a73dd03d1c637b303155ddca
- Branch / Tag: refs/tags/0.0.2
- Owner: https://github.com/BatsResearch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: make_release.yaml@f2a2a4b6cd5c6ab4a73dd03d1c637b303155ddca
- Trigger Event: push

ir-trove 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Trove

Quick Tour

Training

Data Manipulation

Inference

Distributed Environments

Citation

Acknowledgment

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance