Sparse Embeddings for Neural Search.
Project description
Neural-Cherche
Neural Search
Neural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset. Neural-Cherche also provide classes to run efficient inference on a fine-tuned retriever or ranker. Neural-Cherche aims to offer a straightforward and effective method for fine-tuning and utilizing neural search models in both offline and online settings. It also enables users to save all computed embeddings to prevent redundant computations.
Neural-Cherche is compatible with CPU, GPU and MPS devices. We can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint. Splade and SparseEmbed are more tricky to fine-tune and need a MLM pre-trained model.
Installation
We can install neural-cherche using:
pip install neural-cherche
If we plan to evaluate our model while training install:
pip install "neural-cherche[eval]"
Documentation
The complete documentation is available here.
Quick Start
Your training dataset must be made out of triples (anchor, positive, negative)
where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.
X = [
("anchor 1", "positive 1", "negative 1"),
("anchor 2", "positive 2", "negative 2"),
("anchor 3", "positive 3", "negative 3"),
]
And here is how to fine-tune ColBERT from a Sentence Transformer pre-trained checkpoint using neural-cherche:
import torch
from neural_cherche import models, utils, train
model = models.ColBERT(
model_name_or_path="raphaelsty/neural-cherche-colbert",
device="cuda" if torch.cuda.is_available() else "cpu" # or mps
)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)
X = [
("query", "positive document", "negative document"),
("query", "positive document", "negative document"),
("query", "positive document", "negative document"),
]
for step, (anchor, positive, negative) in enumerate(utils.iter(
X,
epochs=1, # number of epochs
batch_size=8, # number of triples per batch
shuffle=True
)):
loss = train.train_colbert(
model=model,
optimizer=optimizer,
anchor=anchor,
positive=positive,
negative=negative,
step=step,
gradient_accumulation_steps=50,
)
if (step + 1) % 1000 == 0:
# Save the model every 1000 steps
model.save_pretrained("checkpoint")
Retrieval
Here is how to use the fine-tuned ColBERT model to re-rank documents:
import torch
from lenlp import sparse
from neural_cherche import models, rank, retrieve
documents = [
{"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
{"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
{"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},
]
retriever = retrieve.BM25(
key="id",
on=["title", "text"],
count_vectorizer=sparse.CountVectorizer(
normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]
),
k1=1.5,
b=0.75,
epsilon=0.0,
)
model = models.ColBERT(
model_name_or_path="raphaelsty/neural-cherche-colbert",
device="cuda" if torch.cuda.is_available() else "cpu", # or mps
)
ranker = rank.ColBERT(
key="id",
on=["title", "text"],
model=model,
)
documents_embeddings = retriever.encode_documents(
documents=documents,
)
retriever.add(
documents_embeddings=documents_embeddings,
)
Now we can retrieve documents using the fine-tuned model:
queries = ["Paris", "Montreal", "Bordeaux"]
queries_embeddings = retriever.encode_queries(
queries=queries,
)
ranker_queries_embeddings = ranker.encode_queries(
queries=queries,
)
candidates = retriever(
queries_embeddings=queries_embeddings,
batch_size=32,
k=100, # number of documents to retrieve
)
# Compute embeddings of the candidates with the ranker model.
# Note, we could also pre-compute all the embeddings.
ranker_documents_embeddings = ranker.encode_candidates_documents(
candidates=candidates,
documents=documents,
batch_size=32,
)
scores = ranker(
queries_embeddings=ranker_queries_embeddings,
documents_embeddings=ranker_documents_embeddings,
documents=candidates,
batch_size=32,
)
scores
[[{'id': 0, 'similarity': 22.825355529785156},
{'id': 1, 'similarity': 11.201947212219238},
{'id': 2, 'similarity': 10.748161315917969}],
[{'id': 1, 'similarity': 23.21628189086914},
{'id': 0, 'similarity': 9.9658203125},
{'id': 2, 'similarity': 7.308732509613037}],
[{'id': 1, 'similarity': 6.4031805992126465},
{'id': 0, 'similarity': 5.601611137390137},
{'id': 2, 'similarity': 5.599479675292969}]]
Neural-Cherche provides a SparseEmbed
, a SPLADE
, a TFIDF
, a BM25
retriever and a ColBERT
ranker which can be used to re-order output of a retriever. For more information, please refer to the documentation.
Pre-trained Models
We provide pre-trained checkpoints specifically designed for neural-cherche: raphaelsty/neural-cherche-sparse-embed and raphaelsty/neural-cherche-colbert. Those checkpoints are fine-tuned on a subset of the MS-MARCO dataset and would benefit from being fine-tuned on your specific dataset. You can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint in order to fit your specific language. You should use a MLM based-checkpoint to fine-tune SparseEmbed.
scifact dataset | ||||
---|---|---|---|---|
model | HuggingFace Checkpoint | ndcg@10 | hits@10 | hits@1 |
TfIdf | - | 0,62 | 0,86 | 0,50 |
BM25 | - | 0,69 | 0,92 | 0,56 |
SparseEmbed | raphaelsty/neural-cherche-sparse-embed | 0,62 | 0,87 | 0,48 |
Sentence Transformer | sentence-transformers/all-mpnet-base-v2 | 0,66 | 0,89 | 0,53 |
ColBERT | raphaelsty/neural-cherche-colbert | 0,70 | 0,92 | 0,58 |
TfIDF Retriever + ColBERT Ranker | raphaelsty/neural-cherche-colbert | 0,71 | 0,94 | 0,59 |
BM25 Retriever + ColBERT Ranker | raphaelsty/neural-cherche-colbert | 0,72 | 0,95 | 0,59 |
Neural-Cherche Contributors
References
-
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking authored by Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2021.
-
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2022.
-
SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.
-
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT authored by Omar Khattab, Matei Zaharia, SIGIR 2020.
License
This Python library is licensed under the MIT open-source license, and the splade model is licensed as non-commercial only by the authors. SparseEmbed and ColBERT are fully open-source including commercial usage.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file neural_cherche-1.4.3.tar.gz
.
File metadata
- Download URL: neural_cherche-1.4.3.tar.gz
- Upload date:
- Size: 31.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc6a4f234dcd76b0495a7843dbc8a191a67f92114e257ead5e452a54bcd72939 |
|
MD5 | 6eb33b157059ff5f8a3b1413bb403f88 |
|
BLAKE2b-256 | 860994384b15475747a7aeef6146eff07c5b3a10bd1a09a4f00eb48ea0003f7f |