Skip to main content

Yet Another BM25 algorithm implementation with update method, so that you dont need to fit on old + new documents.

Project description

yeabm25

Yet Another BM25 algorithm implementation with helpful implementation of:

  1. functionallity to update the index with .update() method. In fack you can use just update.
  2. per document vector.
  3. sparse vector support

Installation:

pip install yeabm25

Quickstart

from yeabm25 import YeaBM25
import nltk 
nltk.download('stopwords', quiet=True)
stopwords_en = set(stopwords.words('english'))

def normalize_for_bm(text: str):
    text = re.sub("[^a-zA-z1-9]", " ", text)
    words = text.lower().split()
    return [word for word in words if word not in stopwords_en]

corpus = ["The quick brown fox jumps over the lazy dog",
          "The lazy dog is brown",
          "The fox is brown",
          "Hello there good man!",
          "It is quite windy in London",
          "How is the weather today man?",
          ]
normalized_corpus = [normalize_for_bm(txt) for txt in corpus]

# fitting the whole corpus
yeabm = YeaBM25(epsilon=0.25)
yeabm.fit(normalized_corpus)

# fit and then update 
bm_update = YeaBM25(epsilon=0.25)
bm_update.fit(normalized_corpus[:3])
bm_update.update(normalized_corpus[3:])

assert yeabm.doc_len == bm_update.doc_len
assert yeabm.average_idf == bm_update.average_idf
assert yeabm.idf == bm_update.idf
assert yeabm.get_scores(['fox', 'jump']) == bm_update.get_scores(['fox', 'jump'])).all()

This work is inspired(and uses some code and ideas) by this great package - https://github.com/dorianbrown/rank_bm25/tree/master. The main focus is creating document and query vectors (supports sparse vectors). Then using the vectors with your favourite Vector DB.

How to get the document and query vectors:

# recommended approach for large corpus, returns iterator. Each element is sparse vector. 
# To represent a sparse vector we can use:
# - Dict[int, float] <--- This is currently the sparse format in YeaBM25
# - Any of the scipy.sparse sparse matrices class family with shape[0] == 1
# - Iterable[Tuple[int, float]]

# this method returns generator object
yeabm.iter_document_vectors() # or
yeabm.iter_document_vectors_sparse() # <--- recommended for usage with Vector DB
# use it in loop
for vector in yeabm.iter_document_vectors_sparse():
    # dostuff could be put in DB. 
    dostuff(vector)

query = ...
yeabm.encode_query(query)

Why would you want to do that? Essentially the BM25 score formula is a sum, so it is a perfect candidate for one of the metrics any DB supports - inner product (IP).

# 
bm_index.get_scores(['quick', 'fox'])
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

# you get the same scores like so:
yeabm.get_embeddings() @ np.asarray(yeabm.encode_query_dense(['fox', 'quick']))
# ~ [1.30, 0.0, 0.72, 0.0, 0.0, 0.0]

Of course you would like to leave the last calculation to the Vector DB.

One more opinionated implementation is that words that are found in more than half of the corpus would not have idf of 0. It would be small but still positive. For example in other implementations:

from rank_bm25 import BM25Okapi
okapi = BM25Okapi(normalized_corpus)
okapi.get_scores(['brown']) 
# [0. 0. 0. 0. 0. 0.]
# where 
yeabm.get_scores(['brown'])
#[0.18 0.28 0.33  0. 0. 0.]
# this is helpful if the user is looking for a term that is abundant in the corpus and would still get somewhat useful results
# where with BM25Okapi you would get essentially random results (or no results).

Usage examples:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yeabm25-0.1.4.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

yeabm25-0.1.4-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file yeabm25-0.1.4.tar.gz.

File metadata

  • Download URL: yeabm25-0.1.4.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.4.tar.gz
Algorithm Hash digest
SHA256 e0cf15d983bab2f9ef70ec8e6709d350924a6f202acc7fd74e6d13b2502e82ef
MD5 dbe18449975f5569af3f89ae65343f6d
BLAKE2b-256 0e4a42d1c363f18538478fe9e2ea8a88cad3e5be993e6e4f51bcebf3c416e226

See more details on using hashes here.

File details

Details for the file yeabm25-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: yeabm25-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b79ad77b0d15e5fb08a5100c517414e033e182ec9a756a88cc8774046dee7971
MD5 456f9006fdb27cc2d5f7945ea1c7eb11
BLAKE2b-256 91cef09968b893cb709a421d85db2f3d6dcf39eb68d6d6e594d126e1ee48ff27

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page