This package contains our code for the paper [URL].
Project description
RGPL
This repository contains the code for paper : [url].
For the pretrained models, you can download them using the huggingface repo. Follow the URL: [URL].
Results for the BEIR and LoTTE sets are below!
Installation
We provide two ways of installation. You can install the project using poetry, or pip.
Installation from source using poetry
-
Clone the repository:
git clone https://github.com/your-username/DenseIG.git
-
Install the project using poetry install:
poetry install
Installation from source using pip
-
Clone the repository:
git clone https://github.com/your-username/DenseIG.git
-
Install the project using pip install:
pip install -r requirements.txt
Install from pypi
Additionally, if you want to extend on our code install the source from pypi.
Usage
R-GPL works in three stages. First, we generate queries from the source documents. Later, initial hard negatives for these queries are mined using pre-trained dense retriever. Lastly, we use this generated data and distill knowledge from cross-encoder model to the dense retriever model.
R-GPL remines the hard negatives every k step with the model going under the domain adaptation! You can also specify to distill knowledge from multiple cross-encoder, and play around with the reducer function.
- Generate Pseudo Queries from the corpus.
python3 gpl_query_writer.py data.given_path="PATH_TO_BEIR_DATA" query_writer.batch_size=128
To change the config, override these arguments.
query_writer:
queries_per_passage: -1
batch_size: 8
augmented: no_aug
use_train_qrels: False
top_p: 0.95
top_k: 25
max_length: 64
augment_probability : 1.0
forward_model_path: Helsinki-NLP/opus-mt-en-fr
back_model_path: Helsinki-NLP/opus-mt-fr-en
augment_per_query : 2
augment_temperature : 2.0
When queries per passage is -1, we use the predetermined query amount. Augmentation is not used in our paper, however you are free to experiment with it. This augmentation first tranlates the generated query to french, and then back translates into english.
- Mine the initial hard negatives
python3 gpl_hard_negative_miner.py data.given_path="PATH_TO_BEIR_DATA"
To change the config, override these arguments.
hard_negative_miner:
negatives_per_query: 50
query_augment_mod: ${query_writer.augmented}
models: ["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"]
score: [cos_sim, cos_sim]
use_train_qrels: False
- Start the training procedure
python3 gpl_trainer.py data.given_path=data.given_path="PATH_TO_BEIR_DATA"
To change the config, override these arguments.
trainer:
cross_encoders : ["cross-encoder/ms-marco-MiniLM-L-6-v2"]
bi_retriver : GPL/msmarco-distilbert-margin-mse
bi_retriver_name: gpl
reducer : average
t_total: 140000
eval_every: 25000
remine_hard_negatives_every: 25000
batch_size: 32
warmup_steps: 1000
amp_training: True
evaluate_baseline: False
load_test : True
max_seq_length: 350
seed: 1
name: ${trainer.remine_hard_negatives_every}_${trainer.bi_retriver_name}_${trainer.reducer}
We extend the GPL Training pipeline, and now we can also use multiple cross encoders as teachers. This feature was not included in our paper, however you are free to experiment with it!
To have the same setup as GPL paper, bump remine_hard_negatives_every to a number > 140000.
Trainer saves the models, and logs in the directory that the model was called from. Moreover, hard negatives each refresh, generated queries, and generated qrels are saved in the PATH_TO_BEIR_DATA.
Expected Results
Running the repo following the comments below with the provided test data should result in
python3 gpl_query_writer.py data.given_path=test_data/arguana query_writer.queries_per_passage=1
python3 gpl_hard_negative_miner.py data.given_path=test_data/arguana```
python3 gpl_trainer.py data.given_path=test_data/arguana trainer.t_total=2000 trainer.remine_hard_negatives_every=1000 trainer.batch_size=8
Before: NDCG@10: 0.3388 After: NDCG@10: 0.3853
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rgpl-0.1.0.tar.gz
.
File metadata
- Download URL: rgpl-0.1.0.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.12 Linux/4.18.0-372.80.1.el8_6.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 495be097f24c21f08e26f35980809d59a89deab482672d2b33674dda1505ce52 |
|
MD5 | b288e76e01cf4c9b102a4470dd227be2 |
|
BLAKE2b-256 | 91fb8a9c6f5ec79bfa0bbc2a9fbd6c96bf84cd7bb96b04ff6af3676c3466759a |
File details
Details for the file rgpl-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: rgpl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.12 Linux/4.18.0-372.80.1.el8_6.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41fe73df7005617cf60c8451d7b5596e98c3ed9f764db58688b584a2e4e6e7bb |
|
MD5 | cd666d77463f2e49bbbaf593168a653f |
|
BLAKE2b-256 | 1ad467bffbec0f81ae4182948fa90d06f5159a9d1a5e49eee4112d8ff589ce14 |