Hybrid document retrieval: BM25 + TF-IDF fused with Reciprocal Rank Fusion, with bilingual FR/EN stopwords.

These details have not been verified by PyPI

Project links

Project description

hybrid-retrieval-scoring

Lightweight hybrid document retrieval for Python. It combines two complementary relevance signals — BM25 keyword scoring and TF-IDF cosine similarity — and merges their rankings with Reciprocal Rank Fusion (RRF). Stopword lists ship bilingual (French + English) so mixed-language corpora work out of the box.

Why these three pieces

BM25 is a strong sparse keyword ranker: it rewards rare query terms and saturates term frequency, which makes it robust for short documents and exact-match queries. It is provided by the fast bm25s library and can be built once and persisted to disk.
TF-IDF cosine similarity (via scikit-learn's TfidfVectorizer) gives a second sparse signal with different normalization characteristics. Where BM25 and TF-IDF disagree, the fusion step arbitrates.
Reciprocal Rank Fusion merges any number of ranked lists without needing to calibrate their raw scores onto a common scale. Each list contributes weight / (k + rank) per document, with k = 60 (the standard constant from Cormack et al.). Documents that rank high across multiple signals rise to the top.

Why bilingual FR/EN stopwords

Stopwords (function words like the, le, of, de) carry little retrieval signal and inflate term-frequency noise. A corpus that mixes French and English documents needs both lists removed, otherwise English stopwords pollute French documents and vice versa. The default STOPWORDS_BILINGUAL is the union of a French and an English list, so a French query against a mixed corpus is not diluted by untrimmed English function words (and the reverse).

Install

From PyPI:

pip install hybrid-retrieval-scoring

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/hybrid-retrieval-scoring.git

Runtime dependencies: bm25s, scikit-learn, numpy.

Usage

Build and persist a BM25 index

from hybrid_retrieval import BM25Scorer

corpus = [
    "Le chat dort sur le canapé du salon.",
    "The quick brown fox jumps over the lazy dog.",
    "La météo annonce de la pluie demain matin.",
]

scorer = BM25Scorer()          # bilingual FR+EN stopwords by default
scorer.build_index(corpus)
scorer.save_index()            # written to BM25_INDEX_PATH (default /tmp/hybrid-retrieval/bm25_index)

The index location is parameterizable. Pass it to the constructor, or set the BM25_INDEX_PATH environment variable:

from pathlib import Path
scorer = BM25Scorer(index_path=Path("./my_index"))

Score a query

from hybrid_retrieval import BM25Scorer

scorer = BM25Scorer()
scorer.load_index()            # loads the persisted index + corpus
hits = scorer.score("chat canapé", k=5)
# hits == [(0, 1.83), ...]  -> [(doc_index, bm25_score), ...] sorted desc

TF-IDF ranking

from hybrid_retrieval import tfidf_ranked

corpus = [
    "Le chat dort sur le canapé.",
    "The quick brown fox.",
    "La pluie tombe sur la ville.",
]
ranking = tfidf_ranked(corpus, "chat canapé", top_k=3)
# ranking == [0, ...]  -> document indices, best first

Fuse two rankings with RRF

from hybrid_retrieval import BM25Scorer, tfidf_ranked, reciprocal_rank_fusion

corpus = [...]                 # your documents
query = "chat canapé"

bm25 = BM25Scorer()
bm25.build_index(corpus)
bm25_ranking = [idx for idx, _score in bm25.score(query, k=None)]
tfidf_ranking = tfidf_ranked(corpus, query)

fused = reciprocal_rank_fusion(
    [bm25_ranking, tfidf_ranking],
    weights=[1.0, 0.7],        # optional per-signal weights
    k=60,                      # standard RRF constant
)
# fused == [(doc_index, rrf_score), ...] sorted by score descending

A document ranked highly by both BM25 and TF-IDF ends up on top, even if no single signal placed it first.

Public API

Symbol	Description
`ScoredItem`	Dataclass for a scored document (`index`, `score`, `method`, `text`, `source_type`, `priority`).
`BM25Scorer`	BM25 scorer with `build_index` / `save_index` / `load_index` / `score`.
`reciprocal_rank_fusion`	Merge ranked lists with RRF (`k=60`).
`tfidf_ranked`	TF-IDF cosine-similarity ranking.
`STOPWORDS_FR`, `STOPWORDS_EN`, `STOPWORDS_BILINGUAL`	Stopword lists.
`INDEX_PATH`	Default BM25 index location.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybrid_retrieval_scoring-0.1.0.tar.gz (12.1 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hybrid_retrieval_scoring-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file hybrid_retrieval_scoring-0.1.0.tar.gz.

File metadata

Download URL: hybrid_retrieval_scoring-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for hybrid_retrieval_scoring-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cb98614e801206a3019f04d76df26adedbaa472b5503dff6c0648e926a7d605b`
MD5	`ca2af17da3d3f6b9726bc9492b7111dc`
BLAKE2b-256	`f8e4cccb5fd4a23a09f5add1b14102ecd39e8f06cbfb406b8288acd324a43bed`

See more details on using hashes here.

File details

Details for the file hybrid_retrieval_scoring-0.1.0-py3-none-any.whl.

File metadata

Download URL: hybrid_retrieval_scoring-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for hybrid_retrieval_scoring-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7dedc23a6004e1adef88a9b67089973979207dc8b472bc228ecc8096e743258`
MD5	`4d3b4ac4805306223929d3a85b796280`
BLAKE2b-256	`af96decfe5afe68530b8fd08d96e079c6a4501f7e0d5d5e72be4568792473e90`

See more details on using hashes here.

hybrid-retrieval-scoring 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hybrid-retrieval-scoring

Why these three pieces

Why bilingual FR/EN stopwords

Install

Usage

Build and persist a BM25 index

Score a query

TF-IDF ranking

Fuse two rankings with RRF

Public API

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes