Skip to main content

Official implementation of "Referral Augmentation for Zero-Shot Information Retrieval"

Project description

Referral-augmented retrieval (RAR)

Installation

Install with pip:

pip install referral-augment

Alternatively, install from source:

git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
pip install -e .

Overview

Simple, general implementations of referral-augmented retrieval are provided in rar.retrievers. We support three aggregation methods — concatenation, mean, and shortest path — as described in the paper, which can be specified via an AggregationType constructor argument.

Under our framework, retrieval with BM25 is as simple as:

from rar.retrievers import BM25Retriever
retriever = BM25Retriever(docs, referrals)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)

Similarly, retrieval with any dense embedding model on HuggingFace:

from rar.retrievers import DenseRetriever, AggregationType
from rar.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder('facebook/contriever')
retriever = DenseRetriever(encoder, docs, referrals, aggregation=AggregationType.MEAN)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)

For convenience, we also include direct implementations of the SimCSEEncoder and SpecterEncoder.

Example replications of paper results showing the advantage of referral augmentation and demonstrating a full, concise retrieval and evaluation pipeline can be found in examples.ipynb.

Optional: install with SimCSE support

Note that the only stable version of SimCSE is currently using its source as a module, which requires building rar from source. Thus, optionally to install with support for SimCSEEncoder:

git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
cd src/rar/encoders
git clone https://github.com/princeton-nlp/SimCSE
cd SimCSE
pip install -r requirements.txt
cd ../../../..

Data

We provide sample data in zipped form here — to use, unzip and place data/ under the repository's root directory.

Our sample data covers two domains, each with a corpus of documents and referrals and an evaluation dataset of queries and ground truth documents. Under the paper_retrieval domain, we include the acl, acl_small, and arxiv corpuses and datasets, and under the entity_retrieval domain, we include the dbpedia_small corpus and dataset.

Construction details:

  • The acl_small, acl, and arxiv corpuses are constructed from the rich paper metadata parses provided by Allen AI's S2ORC project. Documents consist of concatenated paper titles and abstracts from up-to-2017 ACL and ArXiv papers, respectively, and referrals consist of in-text citations between up-to-2017 papers. The respective evaluation datasets are also from the parses, consisting of in-text citations from 2018-and-on papers citing the up-to-2017 papers in the corpus -- this time-based split prevents data leakage and mirrors deployment conditions.
  • The dbpedia_small corpus and dataset is sampled from the DBPedia task in the BEIR benchmark. Referrals are mined from Wikipedia HTML using WikiExtractor.

Data can be loaded via our utility functions at rar.utils:

from rar.utils import load_corpus, load_eval_dataset
docs, referrals = load_corpus(domain='paper_retrieval', corpus='acl_small')
queries, ground_truth = load_eval_dataset(domain='paper_retrieval', dataset='acl_small')

Our data representations are simple and intuitive: — A corpus is a lists of document strings — A set of referrals is a list of lists of document strings (one list of referrals per document) Similarly: — A set of queries is a lists of query strings — The corresponding ground_truth is either a list of document strings (one ground truth document per query, e.g. the cited paper in paper retrieval) or a list of lists of document strings (multiple relevant ground truth documents per query, e.g. all relevant Wikipedia pages for a given dbpedia_small query)

Custom data

Creating a corpus is as simple as constructing these lists (referrals are optional). For example:

docs = ['Steve Jobs was a revolutionary technological thinker and designer', "Bill Gates founded the world's largest software company"]
referrals = [['Apple CEO', 'Magic Leap founder'], ['Microsoft CEO', 'The Giving Pledge co-founder']]

retriever = BM25Retriever(docs, referrals)

Creating an evaluation corpus is similarly easy:

queries = ['Who built the Apple Macintosh?']
ground_truth = [docs[0]]

Evaluation

We implement the Recall@k and MRR metrics under rar.metrics, which can be used standalone or with our utility functions at rar.utils:

from rar.utils import evaluate_retriever
evaluate_retriever(retriever, queries, ground_truth)

By default, evaluate_retrieval attempts to compute MRR, Recall@1, and Recall@10 metrics. The keyword parameter multiple_correct=False removes MRR, since it does not support multiple ground truth documents per query (e.g. for dbpedia_small). See examples.ipynb for example outputs.

If you find this repository helpful, feel free to cite our publication Referral Augmentation for Zero-Shot Information Retrieval:

@misc{tang2023referral,
      title={Referral Augmentation for Zero-Shot Information Retrieval}, 
      author={Michael Tang and Shunyu Yao and John Yang and Karthik Narasimhan},
      year={2023},
      eprint={2305.15098},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

referral-augment-0.1.1.tar.gz (13.1 kB view hashes)

Uploaded Source

Built Distribution

referral_augment-0.1.1-py3-none-any.whl (15.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page