Skip to main content

Fork of the original UniIR repository modified for easy and simple Pyserini integration

Project description

UniIR for Pyserini

PyPI Downloads Downloads LICENSE

🌐 Homepage | 🤗 Dataset(M-BEIR Benchmark) | 🤗 Checkpoints(UniIR models) | 📖 arXiv | Original UniIR GitHub

This repository contains a fork of the original UniIR codebase, modified for easy Pyserini integration and repackaged as a PyPI package.

current_version = "0.1.2"

Installation

Install the package directly from PyPI:

pip install uniir-for-pyserini

Or, install from source:

git clone https://github.com/castorini/UniIR-for-Pyserini.git
cd UniIR-for-Pyserini
pip install .

Then, install the CLIP model:

pip install git+https://github.com/openai/CLIP.git

Quick Start

The following code snippet shows how UniIR models can be used with Pyserini's encoding and indexing pipeline. In this example, clip-sf-large model is used to encode the cirr_task7 corpus into dense vector representations. Similar steps can be done for on-the-fly query encoding using the QueryEncoder.

For full compatible use and features, please use/refer to these wrapper classes in Pyserini.

# Encoding and Indexing Steps
from pyserini.encode import JsonlCollectionIterator
from pyserini.encode.optional import FaissRepresentationWriter
from uniir_for_pyserini.uniir_corpus_encoder import CorpusEncoder

MBEIR_FIELDS = ['img_path', 'txt', 'modality', 'did']

mbeir_corpus_encoder = CorpusEncoder("clip_sf_large")

collection_iterator = JsonlCollectionIterator(  
    'collections/M-BEIR/mbeir_cirr_task7_cand_pool.jsonl',  
    fields=MBEIR_FIELDS,
    docid_field='did'
)

embedding_writer = FaissRepresentationWriter(
    'indexes/cirr.clip-sf-large'
)

with embedding_writer:
    for batch_info in collection_iterator(32):
        kwargs = {'fp16': True}
        for field_name in MBEIR_FIELDS:
            kwargs[f'{field_name}s'] = batch_info[field_name] 
        
        embeddings = mbeir_corpus_encoder.encode(**kwargs)
        batch_info['vector'] = embeddings
        embedding_writer.write(batch_info, MBEIR_FIELDS) 

# Searching Step
from pyserini.search.faiss import FaissSearcher
from pyserini.query_iterator import MBEIRQueryIterator
from uniir_for_pyserini.uniir_query_encoder import QueryEncoder

mbeir_query_encoder = QueryEncoder("clip_sf_large")

searcher = FaissSearcher(  
        'indexes/cirr.clip-sf-large',
        mbeir_query_encoder  
    )

query_iterator = MBEIRQueryIterator.from_topics('mbeir_cirr_task7_test.jsonl')

results = {}    
for qid, query_data in query_iterator:  
    # query_data now contains the structured M-BEIR format:  
    # {'qid', 'query_txt', 'query_img_path', 'query_modality', 'pos_cand_list'}  
      
    hits = searcher.search(query_data, k=1000) 
    results[qid] = [(hit.docid, hit.score) for hit in hits]

Available Models

Note: L2 Norm isn't applied during encoding because it is applied in the UniIR wrapper classes in Pyserini

This package supports the following UniIR models from the TIGER-Lab UniIR Hugging Face Hub:

  • clip_sf_large
  • blip_ff_large

Contact

For contact regarding the Pyserini integration section, please email Sahel Sharifymoghaddam or Daniel Guo.

For contact regarding the original UniIR codebase, please email the authors of the original UniIR repository.

Citation

If you use this work with Pyserini, please cite Pyserini in addition to the original UniIR paper:

@INPROCEEDINGS{wei2024uniir,
  author = "Cong Wei and Tang Chen and Haonan Chen and Hexiang Hu and Ge Zhang and Jie Fu and Alan Ritter and Wenhu Chen",
  title = "{UniIR}: Training and Benchmarking Universal Multimodal Information Retrievers",
  booktitle = "European Conference on Computer Vision",
  year = 2024,
  pages ="387--404",
}

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

📄 License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniir_for_pyserini-0.1.2.tar.gz (123.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniir_for_pyserini-0.1.2-py3-none-any.whl (160.6 kB view details)

Uploaded Python 3

File details

Details for the file uniir_for_pyserini-0.1.2.tar.gz.

File metadata

  • Download URL: uniir_for_pyserini-0.1.2.tar.gz
  • Upload date:
  • Size: 123.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for uniir_for_pyserini-0.1.2.tar.gz
Algorithm Hash digest
SHA256 77b7cd1ca7751f607a8cb5a6d14d423b943d9fa323eb3a0601055079c9df8354
MD5 6c78d6a2ac016832936d917010f69d57
BLAKE2b-256 d45d30774a5601103ceb8deb94737eca741da3a12b9a4b188b97de0615d6246d

See more details on using hashes here.

File details

Details for the file uniir_for_pyserini-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for uniir_for_pyserini-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6d00b3e4a3dd9f22b6d289bbb295f710bff33e4b493dd6e3e7f5d7cf35116e11
MD5 38e4d5a7c52cc7ead449d5a64a83410c
BLAKE2b-256 ab6ef8d3a950531f2ccc3dd4d2cdc099e8fa4343fb657451c0f5d6bce6eba8bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page