Skip to main content

Hybrid document retrieval: BM25 + TF-IDF fused with Reciprocal Rank Fusion, with bilingual FR/EN stopwords.

Project description

hybrid-retrieval-scoring

Lightweight hybrid document retrieval for Python. It combines two complementary relevance signals — BM25 keyword scoring and TF-IDF cosine similarity — and merges their rankings with Reciprocal Rank Fusion (RRF). Stopword lists ship bilingual (French + English) so mixed-language corpora work out of the box.

Why these three pieces

  • BM25 is a strong sparse keyword ranker: it rewards rare query terms and saturates term frequency, which makes it robust for short documents and exact-match queries. It is provided by the fast bm25s library and can be built once and persisted to disk.
  • TF-IDF cosine similarity (via scikit-learn's TfidfVectorizer) gives a second sparse signal with different normalization characteristics. Where BM25 and TF-IDF disagree, the fusion step arbitrates.
  • Reciprocal Rank Fusion merges any number of ranked lists without needing to calibrate their raw scores onto a common scale. Each list contributes weight / (k + rank) per document, with k = 60 (the standard constant from Cormack et al.). Documents that rank high across multiple signals rise to the top.

Why bilingual FR/EN stopwords

Stopwords (function words like the, le, of, de) carry little retrieval signal and inflate term-frequency noise. A corpus that mixes French and English documents needs both lists removed, otherwise English stopwords pollute French documents and vice versa. The default STOPWORDS_BILINGUAL is the union of a French and an English list, so a French query against a mixed corpus is not diluted by untrimmed English function words (and the reverse).

Install

From PyPI:

pip install hybrid-retrieval-scoring

Or from source (GitHub):

pip install git+https://github.com/JohnLinotte/hybrid-retrieval-scoring.git

Runtime dependencies: bm25s, scikit-learn, numpy.

Usage

Build and persist a BM25 index

from hybrid_retrieval import BM25Scorer

corpus = [
    "Le chat dort sur le canapé du salon.",
    "The quick brown fox jumps over the lazy dog.",
    "La météo annonce de la pluie demain matin.",
]

scorer = BM25Scorer()          # bilingual FR+EN stopwords by default
scorer.build_index(corpus)
scorer.save_index()            # written to BM25_INDEX_PATH (default /tmp/hybrid-retrieval/bm25_index)

The index location is parameterizable. Pass it to the constructor, or set the BM25_INDEX_PATH environment variable:

from pathlib import Path
scorer = BM25Scorer(index_path=Path("./my_index"))

Score a query

from hybrid_retrieval import BM25Scorer

scorer = BM25Scorer()
scorer.load_index()            # loads the persisted index + corpus
hits = scorer.score("chat canapé", k=5)
# hits == [(0, 1.83), ...]  -> [(doc_index, bm25_score), ...] sorted desc

TF-IDF ranking

from hybrid_retrieval import tfidf_ranked

corpus = [
    "Le chat dort sur le canapé.",
    "The quick brown fox.",
    "La pluie tombe sur la ville.",
]
ranking = tfidf_ranked(corpus, "chat canapé", top_k=3)
# ranking == [0, ...]  -> document indices, best first

Fuse two rankings with RRF

from hybrid_retrieval import BM25Scorer, tfidf_ranked, reciprocal_rank_fusion

corpus = [...]                 # your documents
query = "chat canapé"

bm25 = BM25Scorer()
bm25.build_index(corpus)
bm25_ranking = [idx for idx, _score in bm25.score(query, k=None)]
tfidf_ranking = tfidf_ranked(corpus, query)

fused = reciprocal_rank_fusion(
    [bm25_ranking, tfidf_ranking],
    weights=[1.0, 0.7],        # optional per-signal weights
    k=60,                      # standard RRF constant
)
# fused == [(doc_index, rrf_score), ...] sorted by score descending

A document ranked highly by both BM25 and TF-IDF ends up on top, even if no single signal placed it first.

Public API

Symbol Description
ScoredItem Dataclass for a scored document (index, score, method, text, source_type, priority).
BM25Scorer BM25 scorer with build_index / save_index / load_index / score.
reciprocal_rank_fusion Merge ranked lists with RRF (k=60).
tfidf_ranked TF-IDF cosine-similarity ranking.
STOPWORDS_FR, STOPWORDS_EN, STOPWORDS_BILINGUAL Stopword lists.
INDEX_PATH Default BM25 index location.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hybrid_retrieval_scoring-0.1.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hybrid_retrieval_scoring-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file hybrid_retrieval_scoring-0.1.0.tar.gz.

File metadata

  • Download URL: hybrid_retrieval_scoring-0.1.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for hybrid_retrieval_scoring-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cb98614e801206a3019f04d76df26adedbaa472b5503dff6c0648e926a7d605b
MD5 ca2af17da3d3f6b9726bc9492b7111dc
BLAKE2b-256 f8e4cccb5fd4a23a09f5add1b14102ecd39e8f06cbfb406b8288acd324a43bed

See more details on using hashes here.

File details

Details for the file hybrid_retrieval_scoring-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hybrid_retrieval_scoring-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7dedc23a6004e1adef88a9b67089973979207dc8b472bc228ecc8096e743258
MD5 4d3b4ac4805306223929d3a85b796280
BLAKE2b-256 af96decfe5afe68530b8fd08d96e079c6a4501f7e0d5d5e72be4568792473e90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page