Skip to main content

Sparse Embeddings for Neural Search.

Project description

Neural-Cherche

Neural Search

documentation license

Neural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset. Neural-Cherche also provide classes to run efficient inference on a fine-tuned retriever or ranker. Neural-Cherche aims to offer a straightforward and effective method for fine-tuning and utilizing neural search models in both offline and online settings. It also enables users to save all computed embeddings to prevent redundant computations.

Neural-Cherche is compatible with CPU, GPU and MPS devices. We can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint. Splade and SparseEmbed are more tricky to fine-tune and need a MLM pre-trained model.

Installation

We can install neural-cherche using:

pip install neural-cherche

If we plan to evaluate our model while training install:

pip install "neural-cherche[eval]"

Documentation

The complete documentation is available here.

Quick Start

Your training dataset must be made out of triples (anchor, positive, negative) where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.

X = [
    ("anchor 1", "positive 1", "negative 1"),
    ("anchor 2", "positive 2", "negative 2"),
    ("anchor 3", "positive 3", "negative 3"),
]

And here is how to fine-tune ColBERT from a Sentence Transformer pre-trained checkpoint using neural-cherche:

import torch

from neural_cherche import models, utils, train

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu" # or mps
)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)

X = [
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
]

for step, (anchor, positive, negative) in enumerate(utils.iter(
        X,
        epochs=1, # number of epochs
        batch_size=8, # number of triples per batch
        shuffle=True
    )):

    loss = train.train_colbert(
        model=model,
        optimizer=optimizer,
        anchor=anchor,
        positive=positive,
        negative=negative,
        step=step,
        gradient_accumulation_steps=50,
    )

    
    if (step + 1) % 1000 == 0:
        # Save the model every 1000 steps
        model.save_pretrained("checkpoint")

Retrieval

Here is how to use the fine-tuned ColBERT model to re-rank documents:

import torch
from lenlp import sparse

from neural_cherche import models, rank, retrieve

documents = [
    {"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
    {"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
    {"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},
]

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
    count_vectorizer=sparse.CountVectorizer(
        normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]
    ),
    k1=1.5,
    b=0.75,
    epsilon=0.0,
)

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu",  # or mps
)

ranker = rank.ColBERT(
    key="id",
    on=["title", "text"],
    model=model,
)

documents_embeddings = retriever.encode_documents(
    documents=documents,
)

retriever.add(
    documents_embeddings=documents_embeddings,
)

Now we can retrieve documents using the fine-tuned model:

queries = ["Paris", "Montreal", "Bordeaux"]

queries_embeddings = retriever.encode_queries(
    queries=queries,
)

ranker_queries_embeddings = ranker.encode_queries(
    queries=queries,
)

candidates = retriever(
    queries_embeddings=queries_embeddings,
    batch_size=32,
    k=100,  # number of documents to retrieve
)

# Compute embeddings of the candidates with the ranker model.
# Note, we could also pre-compute all the embeddings.
ranker_documents_embeddings = ranker.encode_candidates_documents(
    candidates=candidates,
    documents=documents,
    batch_size=32,
)

scores = ranker(
    queries_embeddings=ranker_queries_embeddings,
    documents_embeddings=ranker_documents_embeddings,
    documents=candidates,
    batch_size=32,
)

scores
[[{'id': 0, 'similarity': 22.825355529785156},
  {'id': 1, 'similarity': 11.201947212219238},
  {'id': 2, 'similarity': 10.748161315917969}],
 [{'id': 1, 'similarity': 23.21628189086914},
  {'id': 0, 'similarity': 9.9658203125},
  {'id': 2, 'similarity': 7.308732509613037}],
 [{'id': 1, 'similarity': 6.4031805992126465},
  {'id': 0, 'similarity': 5.601611137390137},
  {'id': 2, 'similarity': 5.599479675292969}]]

Neural-Cherche provides a SparseEmbed, a SPLADE, a TFIDF, a BM25 retriever and a ColBERT ranker which can be used to re-order output of a retriever. For more information, please refer to the documentation.

Pre-trained Models

We provide pre-trained checkpoints specifically designed for neural-cherche: raphaelsty/neural-cherche-sparse-embed and raphaelsty/neural-cherche-colbert. Those checkpoints are fine-tuned on a subset of the MS-MARCO dataset and would benefit from being fine-tuned on your specific dataset. You can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint in order to fit your specific language. You should use a MLM based-checkpoint to fine-tune SparseEmbed.

scifact dataset
model HuggingFace Checkpoint ndcg@10 hits@10 hits@1
TfIdf - 0,62 0,86 0,50
BM25 - 0,69 0,92 0,56
SparseEmbed raphaelsty/neural-cherche-sparse-embed 0,62 0,87 0,48
Sentence Transformer sentence-transformers/all-mpnet-base-v2 0,66 0,89 0,53
ColBERT raphaelsty/neural-cherche-colbert 0,70 0,92 0,58
TfIDF Retriever + ColBERT Ranker raphaelsty/neural-cherche-colbert 0,71 0,94 0,59
BM25 Retriever + ColBERT Ranker raphaelsty/neural-cherche-colbert 0,72 0,95 0,59

Neural-Cherche Contributors

References

License

This Python library is licensed under the MIT open-source license, and the splade model is licensed as non-commercial only by the authors. SparseEmbed and ColBERT are fully open-source including commercial usage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neural_cherche-1.4.2.tar.gz (31.8 kB view details)

Uploaded Source

File details

Details for the file neural_cherche-1.4.2.tar.gz.

File metadata

  • Download URL: neural_cherche-1.4.2.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for neural_cherche-1.4.2.tar.gz
Algorithm Hash digest
SHA256 89a4e299ac3f20956e4de206cf9bcd8f484e911f9eb4bac6336cb2578ca662fe
MD5 7bd4dbf72e4739b4824303dee606dad4
BLAKE2b-256 30eaeb694abd33aa3de977e340d19620a12fb70b12e192282123555e97592cfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page