Skip to main content

Domain Adaptation for Memory-Efficient Dense Retrieval

Project description

:dollar: What is it?

Index Compression Methods (INCOME) repository helps you easily train and evaluate different memory efficient dense retrievers across any custom dataset. The pre-trained models produce float embeddings of sizes from 512 to even upto 1024. However, when storing a large number of embeddings, this requires quite a lot of memory / storage. We focus on index compression and produce final embeddings which are binary and require less dimensions which help you save both storage and money on hosting such models in practical setup.

We currently support the following memory efficient dense retriever model architectures:

For more information, checkout our publication:

:dollar: Installation

One can either install income via pip

pip install income

or via source using git clone

$ git clone https://github.com/Nthakur20/income.git
$ cd income
$ pip install -e .

With that, you should be ready to go!

:dollar: Models Supported

We currently support training and inference of these compressed dense retrievers within our repository:

Models (on HF) BEIR (Avg. NDCG@10) Memory Size Query Time GCP Cloud Cost per. Month (in $)
No Compression
TAS-B (Hofstatter et al., 2021) TAS-B 0.415 65 GB (1x) 456.9 ms n2-highmem-8 $306.05
TAS-B + HNSW (Hofstatter et al., 2021) TAS-B 0.415 151 GB (1x) 1.8 ms n2-highmem-32 $1224.19
TAS-B + PQ (Hofstatter et al., 2021) TAS-B 0.361 2 GB (32x) 44.0 ms n1-standard-1 $24.27
Supervised Compression
BPR (Yamada et al., 2021) NQ (DPR) 0.201 2.2 GB (32x) 38.1 ms n1-standard-1 $24.27
BPR (Thakur et al., 2022) TAS-B 0.357 2.2 GB (32x) 38.1 ms n1-standard-1 $24.27
JPQ (Zhan et al., 2021) STAR (query) (doc) 0.389 2.2 GB (32x) 44.0 ms n1-standard-1 $24.27
JPQ (Thakur et al., 2022) TAS-B (query) (doc) 0.402 2.2 GB (32x) 44.0 ms n1-standard-1 $24.27

The Index size and costs are estimated for a user who wants to build a semantic search engine over the English Wikipedia containing about 21 million passages you need to encode. Using float32 (and no further compression techniques) and 768 dimensions, the resulting embeddings have a size of about 65GB. The n2-highmem-8 server can provide upto 64 GB of memory, whereas the n1-standard-1 server can provide upto 3.75 GB of memory.

:dollar: Reproduction Scripts with TAS-B

Script BEIR (Avg. NDCG@10) Memory Size
Baselines
fp-16 evaluate_fp16.py 0.414 33 GB (2x)
fp-8 evaluate_fp16.py 0.407 16 GB (4x)
PCA evaluate_pca.py 0.235 22 GB (3x)
TLDR evaluate_pca.py 0.240 22 GB (3x)
PQ evaluate_pq.py 0.361 2.2 GB (32x)
Supervised Compression
BPR bpr_beir_evaluation.py 0.357 2.2 GB (32x)
JPQ jpq_beir_evaluation.py 0.402 2.2 GB (32x)

:dollar: Why should we do domain adaptation?

Script BEIR (Avg. NDCG@10) Memory Size
Supervised Compression
BPR+GenQ train_bpr_genq.sh 0.377 2.2 GB (32x)
BPR+GPL train_bpr_gpl.sh 0.398 2.2 GB (32x)
JPQ+GenQ train_jpq_genq.sh 0.417 2.2 GB (32x)
JPQ+GPL train_jpq_gpl.sh 0.435 2.2 GB (32x)

:dollar: Why should we do domain adaptation?

:dollar: Inference

:dollar: Training

:dollar: BPR

export dataset="nfcorpus"

python -m income.bpr.train \
    --path_to_generated_data "generated/$dataset" \
    --base_ckpt "msmarco-distilbert-base-tas-b" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 10000 \
    --new_size -1 \
    --queries_per_passage -1 \
    --output_dir "output/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-tas-b" "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "dot" "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "gen-t5-base-2-epoch-default-lr-3-ques" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --do_evaluation \
    --use_amp   # Use this for efficient training if the machine supports AMP

:dollar: JPQ

:dollar: Disclaimer

For reproducibility purposes, we work with the original code repositories and modify them in INCOME if they available, for eg. BPR and JPQ. It remains the user's responsibility to determine whether you as a user have permission to use the original models and to cite the right owner of each model. Check the below table for reference.

If you're a model owner and wish to update any part of it, or do not want your model to be included in this library, feel free to post an issue here or make a pull request!

Model/Method Citation GitHub
BPR (Yamada et al., 2021) https://github.com/studio-ousia/bpr
JPQ (Zhan et al., 2021) https://github.com/jingtaozhan/JPQ
GPL (Wang et al., 2021) https://github.com/UKPLab/gpl

:dollar: Citing & Authors

If you find this repository helpful, feel free to cite our recent publication: Domain Adaptation for Memory-Efficient Dense Retrieval:

@article{thakur2022domain,
  title={Domain Adaptation for Memory-Efficient Dense Retrieval},
  author={Thakur, Nandan and Reimers, Nils and Lin, Jimmy},
  journal={arXiv preprint arXiv:2205.11498},
  year={2022},
  url={https://arxiv.org/abs/2205.11498/}
}

The main contributors of this repository are:

Contact person: Nandan Thakur, nandant@gmail.com

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

income-0.0.2.tar.gz (70.8 kB view details)

Uploaded Source

Built Distribution

income-0.0.2-py3-none-any.whl (107.3 kB view details)

Uploaded Python 3

File details

Details for the file income-0.0.2.tar.gz.

File metadata

  • Download URL: income-0.0.2.tar.gz
  • Upload date:
  • Size: 70.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.3 keyring/23.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.12

File hashes

Hashes for income-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7f72491d8a1fcf1c5068709934128182a755271fe7039130d5f8b6bfa317d7b3
MD5 06089adf84925f3d109ce504b0e192b2
BLAKE2b-256 169e502817e269ce8943ead9e64f940774fc11f3b03cf3174a2def209461bf50

See more details on using hashes here.

File details

Details for the file income-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: income-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 107.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.3 keyring/23.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.12

File hashes

Hashes for income-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ae1c027a72785ad6658d43fcd9f9a5145497ebe92ccb305d0282d24c2fa2ceef
MD5 87175d982883e933b3165ed7177fd1d0
BLAKE2b-256 cd0158d5733a866a0f8876c3dd065d91c45f8a8dc6d98b6b33002e4185321698

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page