Skip to main content

Your one-stop shop for fine-tuning and running neural ranking models.

Project description

Lightning IR

lightning ir logo

Your one-stop shop for fine-tuning and running neural ranking models.


Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of PyTorch Lightning to provide a simple and flexible interface to interact with neural ranking models.

Want to:

  • fine-tune your own cross- or bi-encoder models?
  • index and search through a collection of documents with ColBERT or SPLADE?
  • re-rank documents with state-of-the-art models?

Lightning IR has you covered!

Installation

Lightning IR can be installed using pip:

pip install lightning-ir

Getting Started

See the Quickstart guide for an introduction to Lightning IR. The Documentation provides a detailed overview of the library's functionality.

The easiest way to use Lightning IR is via the CLI. It uses the PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

The behavior of the CLI can be customized using yaml configuration files. See the configs directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.

lightning-ir re_rank \
  --config ./configs/trainer/inference.yaml \
  --config ./configs/callbacks/rank.yaml \
  --config ./configs/data/re-rank-trec-dl.yaml \
  --config ./configs/models/monoelectra.yaml

For more details, see the Usage section.

Usage

Command Line Interface

The CLI offers four subcommands:

$ lightning-ir -h
Lightning Trainer command line tool

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    fit                 Runs the full optimization routine.
    index               Index a collection of documents.
    search              Search for relevant documents.
    re_rank             Re-rank a set of retrieved documents.

Configurations files need to be provided to specify model, data, and fine-tuning/inference parameters. See the configs directory for examples. Four types of configurations exists:

  • trainer: Specifies the fine-tuning/inference parameters and callbacks.
  • model: Specifies the model to use and its parameters.
  • data: Specifies the dataset(s) to use and its parameters.
  • optimizer: Specifies the optimizer parameters (only needed for fine-tuning).

Example

The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.

Fine-tuning

To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:

bi-encoder-fit.yaml
trainer:
  callbacks:
  - class_path: ModelCheckpoint
  max_epochs: 1
  max_steps: 100000
data:
  class_path: LightningIRDataModule
  init_args:
    train_batch_size: 32
    train_dataset:
      class_path: TupleDataset
      init_args:
        tuples_dataset: msmarco-passage/train/triples-small
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: bert-base-uncased
    config:
      class_path: BiEncoderConfig
    loss_functions:
    - class_path: RankNet
optimizer:
  class_path: AdamW
  init_args:
    lr: 1e-5
lightning-ir fit --config bi-encoder-fit.yaml

The fine-tuned model is saved in the directory lightning_logs/version_X/huggingface_checkpoint/.

Indexing

We now assume the model from the previous fine-tuning step was moved to the directory models/bi-encoder. To index the MS MARCO passage collection with faiss using the fine-tuned model, use the following configuration file and command:

bi-encoder-index.yaml
trainer:
  callbacks:
  - class_path: IndexCallback
    init_args:
        index_config:
          class_path: FaissFlatIndexConfig
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 256
    inference_datasets:
    - class_path: DocDataset
      init_args:
        doc_dataset: msmarco-passage
lightning-ir index --config bi-encoder-index.yaml

The index is saved in the directory models/bi-encoder/indexes/msmarco-passage.

Searching

To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:

bi-encoder-search.yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
    index_dir: models/bi-encoder/indexes/msmarco-passage
    search_config:
      class_path: FaissFlatSearchConfig
      init_args:
        k: 100
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2019/judged
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2020/judged
lightning-ir search --config bi-encoder-search.yaml

The run files are saved as models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Re-ranking

Assuming we've also fine-tuned a cross-encoder that is saved in the directory models/cross-encoder, we can re-rank the retrieved documents using the following configuration file and command:

cross-encoder-re-rank.yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: CrossEncoderModule
  init_args:
    model_name_or_path: models/cross-encoder
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
lightning-ir re_rank --config cross-encoder-re-rank.yaml

The run files are saved as models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightning_ir-0.0.4.tar.gz (89.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightning_ir-0.0.4-py3-none-any.whl (113.1 kB view details)

Uploaded Python 3

File details

Details for the file lightning_ir-0.0.4.tar.gz.

File metadata

  • Download URL: lightning_ir-0.0.4.tar.gz
  • Upload date:
  • Size: 89.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightning_ir-0.0.4.tar.gz
Algorithm Hash digest
SHA256 9edd001779c682ae79463119932b4a332c7a79c41e7cb6c6616673e3f70d5072
MD5 be6d3543e9ea20143068e27d503c602f
BLAKE2b-256 40e2dd4dc97872845a1fb7f359b169a0ca1478fcad6998e7fb5367e4538688fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_ir-0.0.4.tar.gz:

Publisher: python-publish.yml on webis-de/lightning-ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lightning_ir-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: lightning_ir-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 113.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for lightning_ir-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 59535337056509ee5863871f48d3d212500d28de2b313b4febe7929d0df33088
MD5 1d1d954f5492091a648c7962af7a03a6
BLAKE2b-256 0f5a6cad7a0e555b5e1a2597270153ebe5db08023fb709d73a231bf134f83c99

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_ir-0.0.4-py3-none-any.whl:

Publisher: python-publish.yml on webis-de/lightning-ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page