Skip to main content

Yet Another BM25 algorithm implementation with update method, so that you dont need to fit on old + new documents.

Project description

yeabm25

Yet Another BM25 algorithm implementation with helpful implementation of:

  1. functionallity to update the index with .update() method. In fack you can use just update.
  2. per document vector.
from yeabm25 import YeaBM25
import nltk 
nltk.download('stopwords', quiet=True)
stopwords_en = set(stopwords.words('english'))

def normalize_for_bm(text: str):
    text = re.sub("[^a-zA-z1-9]", " ", text)
    words = text.lower().split()
    return [word for word in words if word not in stopwords_en]

corpus = ["The quick brown fox jumps over the lazy dog",
          "The lazy dog is brown",
          "The fox is brown",
          "Hello there good man!",
          "It is quite windy in London",
          "How is the weather today man?",
          ]
normalized_corpus = [normalize_for_bm(txt) for txt in corpus]

# fitting the whole corpus
yeabm = YeaBM25(epsilon=0.25)
yeabm.fit(normalized_corpus)

# fit and then update 
bm_update = YeaBM25(epsilon=0.25)
bm_update.fit(normalized_corpus[:3])
bm_update.update(normalized_corpus[3:])

assert yeabm.doc_len == bm_update.doc_len
assert yeabm.average_idf == bm_update.average_idf
assert yeabm.idf == bm_update.idf
assert yeabm.get_scores(['fox', 'jump']) == bm_update.get_scores(['fox', 'jump'])).all()

This work is inspired(and uses some code and ideas) by this great package - https://github.com/dorianbrown/rank_bm25/tree/master. The main focus is creating document and query vectors (sparse vectors support - soon(hopefully not "Blizzard soon")). Then using the vectors with your favourite Vector DB.

How to get the document and query vectors:

# recommended approach for large corpus, returns iterator. Each element is list[float]
# returns generator object
yeabm.iter_document_vectors()
# use it 
for vector in yeabm.iter_document_vectors():
    # dostuff could be put in DB. 
    dostuff(vector)

query = ...
yeabm.encode_query(query)

Why would you want to do that? Essentially the BM25 score formula is a sum, so it is a perfect candidate for one of the metrics any DB supports - inner product.

# 
bm_index.get_scores(['quick', 'fox'])
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

# you get the same scores like so:
yeabm.get_embeddings() @ np.asarray(yeabm.encode_query(['fox', 'quick']))
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

Of course you would like to leave the last calculation to the Vector DB.

One more opinionated implementation is that words that are found in more than half of the corpus would not have idf of 0. It would be small but still positive. For example in other implementations:

from rank_bm25 import BM25Okapi
okapi = BM25Okapi(normalized_corpus)
okapi.get_scores(['brown']) 
# [0. 0. 0. 0. 0. 0.]
# where 
yeabm.get_scores(['brown'])
#[0.18 0.28 0.33  0. 0. 0.]
# this is helpful if the user is looking for a term that is abundant in the corpus and would still get somewhat useful results
# where with BM25Okapi you would get essentially random results (or no results).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yeabm25-0.1.1.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

yeabm25-0.1.1-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file yeabm25-0.1.1.tar.gz.

File metadata

  • Download URL: yeabm25-0.1.1.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef1ac5a5dd41e23629cbef07a20b9cfc00bd62a0bf222bb439963c263ec80af0
MD5 88f1026a1c1d2fefba0b606244ada6d1
BLAKE2b-256 2b022a3a7fcb18996d7715bea68f34ab33d7976eacb1c1bee619e818e584e4b1

See more details on using hashes here.

File details

Details for the file yeabm25-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: yeabm25-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0f886993d79b185634f7f77ada70888592aa3f782c4be8f1e12a1d23aef3bf9e
MD5 02571e9d11244e34eff547a9a48e16f5
BLAKE2b-256 566105f84eb1c126c5c28b490dcb770ae6754525f88a553e3e1e0c1e2e1d1718

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page