Skip to main content

Your one-stop shop for fine-tuning and running neural ranking models.

Project description

Lightning IR

lightning ir logo

Your one-stop shop for fine-tuning and running neural ranking models.


Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of PyTorch Lightning to provide a simple and flexible interface to interact with neural ranking models.

Want to:

  • fine-tune your own cross- or bi-encoder models?
  • index and search through a collection of documents with ColBERT or SPLADE?
  • re-rank documents with state-of-the-art models?

Lightning IR has you covered!

Installation

Lightning IR can be installed using pip:

pip install lightning-ir

Getting Started

See the Quickstart guide for an introduction to Lightning IR. The Documentation provides a detailed overview of the library's functionality.

The easiest way to use Lightning IR is via the CLI. It uses the PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

The behavior of the CLI can be customized using yaml configuration files. See the configs directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.

lightning-ir re_rank \
  --config ./configs/trainer/inference.yaml \
  --config ./configs/callbacks/rank.yaml \
  --config ./configs/data/re-rank-trec-dl.yaml \
  --config ./configs/models/monoelectra.yaml

For more details, see the Usage section.

Usage

Command Line Interface

The CLI offers four subcommands:

$ lightning-ir -h
Lightning Trainer command line tool

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    fit                 Runs the full optimization routine.
    index               Index a collection of documents.
    search              Search for relevant documents.
    re_rank             Re-rank a set of retrieved documents.

Configurations files need to be provided to specify model, data, and fine-tuning/inference parameters. See the configs directory for examples. Four types of configurations exists:

  • trainer: Specifies the fine-tuning/inference parameters and callbacks.
  • model: Specifies the model to use and its parameters.
  • data: Specifies the dataset(s) to use and its parameters.
  • optimizer: Specifies the optimizer parameters (only needed for fine-tuning).

Example

The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.

Fine-tuning

To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:

bi-encoder-fit.yaml
trainer:
  callbacks:
  - class_path: ModelCheckpoint
  max_epochs: 1
  max_steps: 100000
data:
  class_path: LightningIRDataModule
  init_args:
    train_batch_size: 32
    train_dataset:
      class_path: TupleDataset
      init_args:
        tuples_dataset: msmarco-passage/train/triples-small
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: bert-base-uncased
    config:
      class_path: BiEncoderConfig
    loss_functions:
    - class_path: RankNet
optimizer:
  class_path: AdamW
  init_args:
    lr: 1e-5
lightning-ir fit --config bi-encoder-fit.yaml

The fine-tuned model is saved in the directory lightning_logs/version_X/huggingface_checkpoint/.

Indexing

We now assume the model from the previous fine-tuning step was moved to the directory models/bi-encoder. To index the MS MARCO passage collection with faiss using the fine-tuned model, use the following configuration file and command:

bi-encoder-index.yaml
trainer:
  callbacks:
  - class_path: IndexCallback
    init_args:
        index_config:
          class_path: FaissFlatIndexConfig
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 256
    inference_datasets:
    - class_path: DocDataset
      init_args:
        doc_dataset: msmarco-passage
lightning-ir index --config bi-encoder-index.yaml

The index is saved in the directory models/bi-encoder/indexes/msmarco-passage.

Searching

To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:

bi-encoder-search.yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
    index_dir: models/bi-encoder/indexes/msmarco-passage
    search_config:
      class_path: FaissFlatSearchConfig
      init_args:
        k: 100
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2019/judged
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2020/judged
lightning-ir search --config bi-encoder-search.yaml

The run files are saved as models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Re-ranking

Assuming we've also fine-tuned a cross-encoder that is saved in the directory models/cross-encoder, we can re-rank the retrieved documents using the following configuration file and command:

cross-encoder-re-rank.yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: CrossEncoderModule
  init_args:
    model_name_or_path: models/cross-encoder
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
lightning-ir re_rank --config cross-encoder-re-rank.yaml

The run files are saved as models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightning_ir-0.0.6.tar.gz (111.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightning_ir-0.0.6-py3-none-any.whl (144.5 kB view details)

Uploaded Python 3

File details

Details for the file lightning_ir-0.0.6.tar.gz.

File metadata

  • Download URL: lightning_ir-0.0.6.tar.gz
  • Upload date:
  • Size: 111.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lightning_ir-0.0.6.tar.gz
Algorithm Hash digest
SHA256 39fa642d51299df563d87119fcfb03fb98c7aae4cdbe0f7120d92ecd08d38644
MD5 6d166c6f6999fdbb00f4805feb770dc1
BLAKE2b-256 23e0db9471e254d3d9318dfded60c143df93994006d3700266cdcadb9cad2180

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_ir-0.0.6.tar.gz:

Publisher: python-publish.yml on webis-de/lightning-ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lightning_ir-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: lightning_ir-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 144.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lightning_ir-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9566c2e2fc1a5a5ed0148c385a634565f7408f6853080a6316fea9423bf23193
MD5 b8c7ded228355a6a91909bfc8b94d065
BLAKE2b-256 67635fc617b387185419b52652fef775c6ae5a358a15e8aa1420e8c1a2a6fb2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_ir-0.0.6-py3-none-any.whl:

Publisher: python-publish.yml on webis-de/lightning-ir

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page