Skip to main content

This package contains our code for the paper [URL].

Project description

RGPL

This repository contains the code for paper : [url].

For the pretrained models, you can download them using the huggingface repo. Follow the URL: [URL].

Results for the BEIR and LoTTE sets are below!

screenshot screenshot

Installation

We provide two ways of installation. You can install the project using poetry, or pip.

Installation from source using poetry

  1. Clone the repository:

    git clone https://github.com/your-username/DenseIG.git
    
  2. Install the project using poetry install:

    poetry install
    

Installation from source using pip

  1. Clone the repository:

    git clone https://github.com/your-username/DenseIG.git
    
  2. Install the project using pip install:

    pip install -r requirements.txt
    

Install from pypi

Additionally, if you want to extend on our code install the source from pypi.

Usage

R-GPL works in three stages. First, we generate queries from the source documents. Later, initial hard negatives for these queries are mined using pre-trained dense retriever. Lastly, we use this generated data and distill knowledge from cross-encoder model to the dense retriever model.

R-GPL remines the hard negatives every k step with the model going under the domain adaptation! You can also specify to distill knowledge from multiple cross-encoder, and play around with the reducer function.

  1. Generate Pseudo Queries from the corpus.
python3 gpl_query_writer.py data.given_path="PATH_TO_BEIR_DATA" query_writer.batch_size=128

To change the config, override these arguments.

query_writer:
  queries_per_passage: -1
  batch_size: 8
  augmented: no_aug
  use_train_qrels: False
  top_p: 0.95
  top_k: 25
  max_length: 64
  augment_probability : 1.0
  forward_model_path: Helsinki-NLP/opus-mt-en-fr
  back_model_path: Helsinki-NLP/opus-mt-fr-en
  augment_per_query : 2
  augment_temperature : 2.0

When queries per passage is -1, we use the predetermined query amount. Augmentation is not used in our paper, however you are free to experiment with it. This augmentation first tranlates the generated query to french, and then back translates into english.

  1. Mine the initial hard negatives
python3 gpl_hard_negative_miner.py data.given_path="PATH_TO_BEIR_DATA"

To change the config, override these arguments.

hard_negative_miner:
  negatives_per_query: 50
  query_augment_mod: ${query_writer.augmented}
  models: ["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"]
  score: [cos_sim, cos_sim]
  use_train_qrels: False
  1. Start the training procedure
python3 gpl_trainer.py data.given_path=data.given_path="PATH_TO_BEIR_DATA"

To change the config, override these arguments.

trainer:
  cross_encoders : ["cross-encoder/ms-marco-MiniLM-L-6-v2"]
  bi_retriver : GPL/msmarco-distilbert-margin-mse
  bi_retriver_name: gpl
  reducer : average
  t_total: 140000
  eval_every: 25000
  remine_hard_negatives_every: 25000
  batch_size: 32
  warmup_steps: 1000
  amp_training: True
  evaluate_baseline: False
  load_test : True
  max_seq_length: 350
  seed: 1
  name: ${trainer.remine_hard_negatives_every}_${trainer.bi_retriver_name}_${trainer.reducer}

We extend the GPL Training pipeline, and now we can also use multiple cross encoders as teachers. This feature was not included in our paper, however you are free to experiment with it!

To have the same setup as GPL paper, bump remine_hard_negatives_every to a number > 140000.

Trainer saves the models, and logs in the directory that the model was called from. Moreover, hard negatives each refresh, generated queries, and generated qrels are saved in the PATH_TO_BEIR_DATA.

Expected Results

Running the repo following the comments below with the provided test data should result in

python3 gpl_query_writer.py data.given_path=test_data/arguana query_writer.queries_per_passage=1
python3 gpl_hard_negative_miner.py data.given_path=test_data/arguana```
python3 gpl_trainer.py data.given_path=test_data/arguana trainer.t_total=2000 trainer.remine_hard_negatives_every=1000 trainer.batch_size=8

Before: NDCG@10: 0.3388 After: NDCG@10: 0.3853

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rgpl-0.1.0.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

rgpl-0.1.0-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file rgpl-0.1.0.tar.gz.

File metadata

  • Download URL: rgpl-0.1.0.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.12 Linux/4.18.0-372.80.1.el8_6.x86_64

File hashes

Hashes for rgpl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 495be097f24c21f08e26f35980809d59a89deab482672d2b33674dda1505ce52
MD5 b288e76e01cf4c9b102a4470dd227be2
BLAKE2b-256 91fb8a9c6f5ec79bfa0bbc2a9fbd6c96bf84cd7bb96b04ff6af3676c3466759a

See more details on using hashes here.

File details

Details for the file rgpl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rgpl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.12 Linux/4.18.0-372.80.1.el8_6.x86_64

File hashes

Hashes for rgpl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41fe73df7005617cf60c8451d7b5596e98c3ed9f764db58688b584a2e4e6e7bb
MD5 cd666d77463f2e49bbbaf593168a653f
BLAKE2b-256 1ad467bffbec0f81ae4182948fa90d06f5159a9d1a5e49eee4112d8ff589ce14

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page