Skip to main content

Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Project description

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:


🚨 Announcements

  • (1/28/24) One of the easiest ways to use ColBERT in applications nowadays is the semi-official, fast-growing RAGatouille library.
  • (1/29/23) We have merged a new index updater feature and support for additional Hugging Face models! These are in beta so please give us feedback as you try them out.
  • (1/24/23) If you're looking for the DSPy framework for composing retrievers like ColBERTv2 and LLMs, it's at: https://github.com/stanfordnlp/dspy

ColBERTv1

The ColBERTv1 code from the SIGIR'20 paper is in the colbertv1 branch. See here for more information on other branches.

Installation

(Update: nowadays you can typically do pip install colbert-ai[torch,faiss-gpu] to get things up and running, but if you face issues conda is always more reliable for faiss and torch.)

ColBERT requires Python 3.7+ and Pytorch 1.9+ and uses the Hugging Face Transformers library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)

We have also included a new environment file specifically for CPU-only environments (conda_env_cpu.yml), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES="" as part of your command. Note that a GPU is required for training and indexing.

conda env create -f conda_env[_cpu].yml
conda activate colbert

If you face any problems, please open a new issue and we'll help you promptly!

Overview

Using ColBERT on a dataset typically involves the following steps.

Step 0: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., collection.tsv) will contain all passages and another (e.g., queries.tsv) will contain a set of queries for searching the collection.

Step 1: Download the pre-trained ColBERTv2 checkpoint. This checkpoint has been trained on the MS MARCO Passage Ranking task. You can also optionally train your own ColBERT model.

Step 2: Index your collection. Once you have a trained ColBERT model, you need to index your collection to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.

Step 3: Search the collection with your queries. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task.

API Usage Notebook

NEW: We have an experimental notebook on Google Colab that you can use with free GPUs. Indexing 10,000 on the free Colab T4 GPU takes six minutes.

This Jupyter notebook docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.

Data

This repository works directly with a simple tab-separated file format to store queries, passages, and top-k ranked lists.

  • Queries: each line is qid \t query text.
  • Collection: each line is pid \t passage text.
  • Top-k Ranking: each line is qid \t pid \t rank.

This works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (triples.train.small.tar.gz), the official top-1000 ranked lists for the dev set queries (top1000.dev), and the dev set relevant passages (qrels.dev.small.tsv). For indexing the full collection, you will also need the list of passages (collection.tar.gz).

Indexing

For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example usage:

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):

        config = ColBERTConfig(
            nbits=2,
            root="/path/to/experiments",
        )
        indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
        indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")

Retrieval

We typically recommend that you use ColBERT for end-to-end retrieval, where it directly finds its top-k passages from the full collection:

from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):

        config = ColBERTConfig(
            root="/path/to/experiments",
        )
        searcher = Searcher(index="msmarco.nbits=2", config=config)
        queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
        ranking = searcher.search_all(queries, k=100)
        ranking.save("msmarco.nbits=2.ranking.tsv")

You can optionally specify the ncells, centroid_score_threshold, and ndocs search hyperparameters to trade off between speed and result quality. Defaults for different values of k are listed in colbert/searcher.py.

We can evaluate the MSMARCO rankings using the following command:

python -m utility.evaluate.msmarco_passages --ranking "/path/to/msmarco.nbits=2.ranking.tsv" --qrels "/path/to/MSMARCO/qrels.dev.small.tsv"

Basic Training (ColBERTv1-style)

We provide a pre-trained model checkpoint, but we also detail how to train from scratch here. Note that this example demonstrates the ColBERTv1 style of training, but the provided checkpoint was trained with ColBERTv2.

Training requires a JSONL triples file with a [qid, pid+, pid-] list per line. The query IDs and passage IDs correspond to the specified queries.tsv and collection.tsv files respectively.

Example usage (training on 4 GPUs):

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Trainer

if __name__=='__main__':
    with Run().context(RunConfig(nranks=4, experiment="msmarco")):

        config = ColBERTConfig(
            bsize=32,
            root="/path/to/experiments",
        )
        trainer = Trainer(
            triples="/path/to/MSMARCO/triples.train.small.tsv",
            queries="/path/to/MSMARCO/queries.train.small.tsv",
            collection="/path/to/MSMARCO/collection.tsv",
            config=config,
        )

        checkpoint_path = trainer.train()

        print(f"Saved checkpoint to {checkpoint_path}...")

Advanced Training (ColBERTv2-style)

from colbert.infra.run import Run
from colbert.infra.config import ColBERTConfig, RunConfig
from colbert import Trainer


def train():
    # use 4 gpus (e.g. four A100s, but you can use fewer by changing nway,accumsteps,bsize).
    with Run().context(RunConfig(nranks=4)):
        triples = '/path/to/examples.64.json'  # `wget https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way/resolve/main/examples.json?download=true` (26GB)
        queries = '/path/to/MSMARCO/queries.train.tsv'
        collection = '/path/to/MSMARCO/collection.tsv'

        config = ColBERTConfig(bsize=32, lr=1e-05, warmup=20_000, doc_maxlen=180, dim=128, attend_to_mask_tokens=False, nway=64, accumsteps=1, similarity='cosine', use_ib_negatives=True)
        trainer = Trainer(triples=triples, queries=queries, collection=collection, config=config)

        trainer.train(checkpoint='colbert-ir/colbertv1.9')  # or start from scratch, like `bert-base-uncased`


if __name__ == '__main__':
    train()

Running a lightweight ColBERTv2 server

We provide a script to run a lightweight server which serves k (upto 100) results in ranked order for a given search query, in JSON format. This script can be used to power DSP programs.

To run the server, update the environment variables INDEX_ROOT and INDEX_NAME in the .env file to point to the appropriate ColBERT index. The run the following command:

python server.py

A sample query:

http://localhost:8893/api/search?query=Who won the 2022 FIFA world cup&k=25

Branches

Supported branches

  • main: Stable branch with ColBERTv2 + PLAID.
  • colbertv1: Legacy branch for ColBERTv1.

Deprecated branches

  • new_api: Base ColBERTv2 implementation.
  • cpu_inference: ColBERTv2 implementation with CPU search support.
  • fast_search: ColBERTv2 implementation with PLAID.
  • binarization: ColBERT with a baseline binarization-based compression strategy (as opposed to ColBERTv2's residual compression, which we found to be more robust).

Acknowledgments

ColBERT logo designed by Chuyi Zhang.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colbert_ai-0.2.21.tar.gz (88.0 kB view details)

Uploaded Source

Built Distribution

colbert_ai-0.2.21-py3-none-any.whl (116.1 kB view details)

Uploaded Python 3

File details

Details for the file colbert_ai-0.2.21.tar.gz.

File metadata

  • Download URL: colbert_ai-0.2.21.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for colbert_ai-0.2.21.tar.gz
Algorithm Hash digest
SHA256 a8d6fdb4e2272f2b08ed37f8e5096072160d8415d1e40585751898b77e625bab
MD5 13832469a4c37f673b3d982f6d96a94a
BLAKE2b-256 bcdc7edb06e3bb01326610ecfdfc8e396c6867ba7de6e58cda2356a604419899

See more details on using hashes here.

File details

Details for the file colbert_ai-0.2.21-py3-none-any.whl.

File metadata

  • Download URL: colbert_ai-0.2.21-py3-none-any.whl
  • Upload date:
  • Size: 116.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for colbert_ai-0.2.21-py3-none-any.whl
Algorithm Hash digest
SHA256 8c17e7be44e7f3989f2067f1176af4f65f4612d62850586657e8afb8314cb2a6
MD5 92f652d4260242df442eba8c0551f857
BLAKE2b-256 8d9c5d847be0f05e5266880fb1c183e642a6c34cd6a101c1d6219dfa74887543

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page