Skip to main content

Yet Another BM25 algorithm implementation with update method, so that you dont need to fit on old + new documents.

Project description

yeabm25

Yet Another BM25 algorithm implementation with helpful implementation of:

  1. functionallity to update the index with .update() method. In fack you can use just update.
  2. per document vector.
from yeabm25 import YeaBM25
import nltk 
nltk.download('stopwords', quiet=True)
stopwords_en = set(stopwords.words('english'))

def normalize_for_bm(text: str):
    text = re.sub("[^a-zA-z1-9]", " ", text)
    words = text.lower().split()
    return [word for word in words if word not in stopwords_en]

corpus = ["The quick brown fox jumps over the lazy dog",
          "The lazy dog is brown",
          "The fox is brown",
          "Hello there good man!",
          "It is quite windy in London",
          "How is the weather today man?",
          ]
normalized_corpus = [normalize_for_bm(txt) for txt in corpus]

# fitting the whole corpus
yeabm = YeaBM25(epsilon=0.25)
yeabm.fit(normalized_corpus)

# fit and then update 
bm_update = YeaBM25(epsilon=0.25)
bm_update.fit(normalized_corpus[:3])
bm_update.update(normalized_corpus[3:])

assert yeabm.doc_len == bm_update.doc_len
assert yeabm.average_idf == bm_update.average_idf
assert yeabm.idf == bm_update.idf
assert yeabm.get_scores(['fox', 'jump']) == bm_update.get_scores(['fox', 'jump'])).all()

This work is inspired(and uses some code and ideas) by this great package - https://github.com/dorianbrown/rank_bm25/tree/master. The main focus is creating document and query vectors (supports sparse vectors). Then using the vectors with your favourite Vector DB.

How to get the document and query vectors:

# recommended approach for large corpus, returns iterator. Each element is sparse vector. 
# To represent a sparse vector we can use:
# - Dict[int, float] <--- This is currently the sparse format in YeaBM25
# - Any of the scipy.sparse sparse matrices class family with shape[0] == 1
# - Iterable[Tuple[int, float]]

# this method returns generator object
yeabm.iter_document_vectors() # or
yeabm.iter_document_vectors_sparse() # <--- recommended for usage with Vector DB
# use it in loop
for vector in yeabm.iter_document_vectors_sparse():
    # dostuff could be put in DB. 
    dostuff(vector)

query = ...
yeabm.encode_query(query)

Why would you want to do that? Essentially the BM25 score formula is a sum, so it is a perfect candidate for one of the metrics any DB supports - inner product (IP).

# 
bm_index.get_scores(['quick', 'fox'])
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

# you get the same scores like so:
yeabm.get_embeddings() @ np.asarray(yeabm.encode_query_dense(['fox', 'quick']))
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

Of course you would like to leave the last calculation to the Vector DB.

One more opinionated implementation is that words that are found in more than half of the corpus would not have idf of 0. It would be small but still positive. For example in other implementations:

from rank_bm25 import BM25Okapi
okapi = BM25Okapi(normalized_corpus)
okapi.get_scores(['brown']) 
# [0. 0. 0. 0. 0. 0.]
# where 
yeabm.get_scores(['brown'])
#[0.18 0.28 0.33  0. 0. 0.]
# this is helpful if the user is looking for a term that is abundant in the corpus and would still get somewhat useful results
# where with BM25Okapi you would get essentially random results (or no results).

Usage examples:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yeabm25-0.1.3.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

yeabm25-0.1.3-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file yeabm25-0.1.3.tar.gz.

File metadata

  • Download URL: yeabm25-0.1.3.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.3.tar.gz
Algorithm Hash digest
SHA256 331304d9864f0674837e3284e83e85ac8336ecf8e574d48d00eccfc329c30bf3
MD5 6ca0ed35149da3f9a14c195159d90a04
BLAKE2b-256 0a8f5dcc392f08fe4d8eeaed729f8197a184604c5403c4b054e859f6fc61fa03

See more details on using hashes here.

File details

Details for the file yeabm25-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: yeabm25-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5c5b64a2c7a622098ef094f1b274b2c2f622388789616e037b0ae4d4837b9476
MD5 6cb879396a282427d6bb98a85c341821
BLAKE2b-256 cbd06969814186e71c6beec33f14cac56b341be9630fcc85ab68e2a13d7cfea2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page