BM25 Vectorizer (Scikit-learn Compatible)
Project description
BM25 vectorizer
BM25 Transformers and Vectorizer This Python package provides implementations of BM25, BM25L (two variants), BM25+, BM25-adpt, BM25T, and TF₁ₐₚ × IDF ranking functions as scikit-learn compatible transformers and a vectorizer. These are used for information retrieval and text processing, extending the traditional TF-IDF approach with document length normalization and term frequency saturation.
Installation
pip install git+https://github.com/imvladikon/bm25_vectorizer -q
Usage
Similar to tf-idf from sklearn,
BM25Vectorizer(transformer="bm25plus").fit(corpus)
where transformer can be one of the following: bm25, bm25l, bm25l_canonical, bm25plus, bm25adpt, bm25t, tfidf1ap
Feature Extraction
from bm25_vectorizer import BM25Vectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = BM25Vectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.data)
Similarity Calculation
from bm25_vectorizer import BM25Vectorizer
corpus = [
"the quick brown fox jumps over the lazy dog",
"never jump over the lazy dog quickly"
]
vec = BM25Vectorizer(transformer="bm25plus").fit(corpus)
print(vec.similarity("quick brown fox", "lazy dog", metric="cosine"))
print(vec.similarity("fox lazy", "lazy fox", metric="jaccard"))
Ranking
from bm25_vectorizer import BM25Vectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = BM25Vectorizer()
X = vectorizer.fit_transform(corpus)
query = 'first document'
query_vector = vectorizer.transform([query])
similarity = cosine_similarity(X, query_vector)
ranked_indices = np.argsort(similarity.flatten())[::-1]
print("Ranked documents:", ranked_indices)
Direct Retrieval Scores
Use score() when you want direct query-document BM25 scores rather than cosine
similarity over sparse feature vectors:
from bm25_vectorizer import BM25Vectorizer
import numpy as np
corpus = [
"the quick brown fox",
"the lazy dog",
"document retrieval system",
]
vectorizer = BM25Vectorizer(transformer="bm25plus").fit(corpus)
scores = vectorizer.score(["quick fox"])
ranked_indices = vectorizer.rank(["quick fox"])[0]
print("Ranked documents:", ranked_indices)
transform() intentionally returns a sparse document-term feature matrix for
scikit-learn interoperability. score() performs query-document scoring
directly. For BM25+, this includes the canonical delta * idf contribution for
query terms that are absent from a document, which cannot be represented exactly
in a sparse feature matrix without materializing dense absent-term weights.
rank() is a convenience wrapper around score() that returns descending
document indices, optionally with sorted scores. It ranks queries in batches;
lower batch_size to reduce peak memory on large query sets.
Classes
- BM25TransformerBase: Abstract base class for BM25 transformers.
- BM25Transformer: Implements the standard BM25 scoring function.
- BM25LTransformer: Implements BM25L in the
rank_bm25-compatible form (carries an extratf_tdmultiplier outside the normalised-tf factor; deviates from the original Lv & Zhai 2011 paper). - BM25LCanonicalTransformer: Implements canonical BM25L from Lv & Zhai (2011), as reviewed by
Kamphuis et al. (ECIR 2020) and used by the
bm25slibrary. No extratf_tdfactor; retrieval scoring (score()/rank()) includes the constant absent-term baselineidf · (k₁+1)·δ / (k₁+δ)for query terms not present in a document. - BM25PlusTransformer: Implements BM25+, which adds a constant boost to scores.
- BM25AdptTransformer: Implements BM25-adpt, using term-specific $k_1^t$ via information gain.
- BM25TTransformer: Implements BM25T, using term-specific $k_1^t$ via log-logistic estimation.
- TFIDFTransformer: Implements TF₁ₐₚ × IDF, using logarithmic term frequency transformation.
- BM25Vectorizer: Combines CountVectorizer with a BM25 transformer for end-to-end text processing.
Parameters
- k1: Controls term frequency saturation (float, default: 1.5).
- b: Controls document length normalization (float, default: 0.75).
- delta: Additional parameter for BM25L, BM25+, and TF₁ₐₚ × IDF (float, default: 1.0).
- epsilon: Minimum IDF value to prevent negative IDFs (float, default: 0.25).
- use_idf: Whether to apply IDF weighting (bool, default: True).
BM25 Formulas
Below are the formulas for the BM25 variants implemented in this package, provided for validation:
-
ATIRE BM25 IDF: $\text{idf}(t) = \log\left(\frac{N}{n(t)}\right)$
-
ATIRE BM25 Score: $\text{BM25}(t,d) = \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$
-
Standard BM25 IDF: $\text{idf}(t) = \log\left(\frac{N - n(t) + 0.5}{n(t) + 0.5}\right)$
-
Standard BM25 Score: $\text{BM25}(t,d) = \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$
-
BM25L IDF (both variants): $\text{idf}(t) = \log\left(\frac{N + 1}{n(t) + 0.5}\right)$
-
BM25L (
bm25l, rank_bm25-compatible) Score: $\text{BM25L}(t,d) = \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1) \cdot (c(t,d) + \delta)}{k_1 + c(t,d) + \delta}$ where $c(t,d) = \frac{f(t,d)}{1 - b + b \cdot \frac{|d|}{\text{avgdl}}}$ -
BM25L canonical (
bm25l_canonical, Lv & Zhai 2011 / Kamphuis 2020 /bm25s) Score: $\text{BM25L}(t,d) = \text{IDF}(t) \cdot \frac{(k_1 + 1) \cdot (c(t,d) + \delta)}{k_1 + c(t,d) + \delta}$ when $f(t,d) > 0$; $\text{IDF}(t) \cdot \frac{(k_1 + 1) \cdot \delta}{k_1 + \delta}$ when $f(t,d) = 0$ (absent-term baseline, included byscore()/rank()). -
BM25+ IDF: $\text{idf}(t) = \log\left(\frac{N + 1}{n(t)}\right)$
-
BM25+ Score: $\text{BM25+}(t,d) = \text{IDF}(t) \cdot \left( \delta + \frac{f(t,d) \cdot (k_1 + 1)}{k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}}) + f(t,d)} \right)$
-
BM25-adpt IDF: $\text{idf}(t) = -\log_2\left(\frac{n(t) + 0.5}{N + 1}\right)$
-
BM25-adpt Score: $\text{BM25-adpt}(t,d) = \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1^t + 1)}{k_1^t \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}}) + f(t,d)}$ where $k_1^t$ is a term-specific parameter computed via information gain.
-
BM25T IDF: $\text{idf}(t) = \log\left(\frac{N + 1}{n(t) + 0.5}\right)$
-
BM25T Score: $\text{BM25T}(t,d) = \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1^t + 1)}{f(t,d) + k_1^t \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$ where $k_1^t$ is a term-specific parameter computed via log-logistic estimation.
-
TF1ap × IDF IDF: $\text{idf}(t) = \ln\left(\frac{N + 1}{n(t)}\right)$
-
TF1ap × IDF Score: $\text{TF1ap}(t,d) = \text{IDF}(t) \cdot \left(1 + \ln\left(1 + \ln\left(\frac{f(t,d)}{1 - b + b \cdot \frac{|d|}{\text{avgdl}}} + \delta\right)\right)\right)$
Notation
$N$: Total number of documents.
$n(t)$: Number of documents containing term $t$.
$f(t,d)$: Frequency of term $t$ in document $d$.
$|d|$: Length of document $d$.
$\text{avgdl}$: Average document length across the collection.
$k_1$: Term frequency saturation parameter (default: 1.5).
$k_1^t$: Term-specific saturation parameter for BM25-adpt and BM25T.
$b$: Document length normalization parameter (default: 0.75).
$\delta$: Additional parameter for BM25L, BM25+, and TF₁ₐₚ × IDF (default: 1.0).
$\epsilon$: IDF smoothing parameter (default: 0.25).
Validation snapshot
Recent checks were run to verify practical retrieval behavior and formula consistency:
- Real retrieval test (Hugging Face
ag_news, 1000 documents): BM25-based retrieval producedtop-1 = 0.772andtop-5 hit = 0.953(random top-1 baseline:0.278). - Reference agreement (
rank_bm25): Forbm25,bm25l,bm25plus, scores matchrank_bm25exactly on shared corpora/queries (np.allcloseover all queries; rank correlation = 1.0). - Reference agreement (
bm25s): Forbm25l_canonical,score()matchesbm25s(method="bm25l") exactly on shared corpora/queries, including the absent-term baseline contribution. - Oracle and invariants checks:
Expected ranking behavior (coverage, rare-term preference, length normalization) and core invariants (TF saturation, IDF ordering,
b=0vsb=1) were confirmed.
Note on BM25-adpt
bm25adpt can produce negative raw term weights due to its information-gain formulation. This is expected for that variant and does not imply incorrect ranking behavior.
For retrieval, compare documents by ranking/similarity (e.g., cosine over transformed vectors) rather than by interpreting absolute score values across queries.
References
- https://github.com/dorianbrown/rank_bm25
- https://github.com/xhluca/bm25s
- http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf "Improvements to BM25 and Language Models Examined", Trotman et al.
- Lv & Zhai (CIKM 2011), "When Documents Are Very Long, BM25 Fails!" (BM25L).
- Kamphuis, de Vries, Boytsov, Lin (ECIR 2020), "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scalable Retrieval Libraries".
- https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bm25_vectorizer-0.0.4.tar.gz.
File metadata
- Download URL: bm25_vectorizer-0.0.4.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7001310f7bd1b9e7042e7fb2bd3cf732a0afb90a589371fc1ab27be6eacc77dc
|
|
| MD5 |
a06fc2ce372d3e6ed1d8117ea435ff67
|
|
| BLAKE2b-256 |
b1d1d45c559e39c16de9810061f501dee5cd1a4099746f42b97c9d2e5e3d4347
|
File details
Details for the file bm25_vectorizer-0.0.4-py3-none-any.whl.
File metadata
- Download URL: bm25_vectorizer-0.0.4-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fb75051f3d58decaf6bf44b36a42678019bdff4f5bdddf0b7b2e673655445b8
|
|
| MD5 |
4de25562e9b813fecb50fad8038e4aff
|
|
| BLAKE2b-256 |
b8b815a618dc30ed7e76d1b03deecaf426f9a434d3702ad44d7f35b04f322358
|