Skip to main content

Yet Another BM25 algorithm implementation with additional per document vectors

Project description

yeabm25

Yet Another BM25 algorithm implementation with helpful implementation of:

  1. functionallity to update the index with .update() method. In fack you can use just update.
  2. per document vector.
from yeabm25 import YeaBM25
import nltk 
nltk.download('stopwords', quiet=True)
stopwords_en = set(stopwords.words('english'))

def normalize_for_bm(text: str):
    text = re.sub("[^a-zA-z1-9]", " ", text)
    words = text.lower().split()
    return [word for word in words if word not in stopwords_en]

corpus = ["The quick brown fox jumps over the lazy dog",
          "The lazy dog is brown",
          "The fox is brown",
          "Hello there good man!",
          "It is quite windy in London",
          "How is the weather today man?",
          ]
normalized_corpus = [normalize_for_bm(txt) for txt in corpus]

# fitting the whole corpus
bm_index = YeaBM25(epsilon=1)
bm_index.fit(normalized_corpus)

# fit and then update 
bm_update = YeaBM25(epsilon=1)
bm_update.fit(normalized_corpus[:3])
bm_update.update(normalized_corpus[3:])

print(bm_index.doc_len == bm_update.doc_len)
print(bm_index.average_idf == bm_update.average_idf)
print(bm_index.idf == bm_update.idf)
print((bm_index.get_scores(['fox', 'jump']) == bm_update.get_scores(['fox', 'jump'])).all())

This work is inspired(and uses some code and ideas) by this great package - https://github.com/dorianbrown/rank_bm25/tree/master. The main focus is creating document and query vectors (sparse vectors soon). Then using the vectors with your favourite Vector DB.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yeabm25-0.1.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

yeabm25-0.1.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file yeabm25-0.1.0.tar.gz.

File metadata

  • Download URL: yeabm25-0.1.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1e828a3beb1a1abf40f016feaf64fef8777759c353a97763180348a560996a5d
MD5 58ab2357a0ccd171267452e5e4e28569
BLAKE2b-256 f5c6e8227d6b9c664fcc728731c06b63f858f239c5fddfc4badfd7409979c132

See more details on using hashes here.

File details

Details for the file yeabm25-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yeabm25-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for yeabm25-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2915685b9ffb558237f0df204c119eaea9e7a7b3e3453b22ad3ff9da6176441
MD5 4ba7e6c083324af16d76df0a91f94a7a
BLAKE2b-256 8358c33677358dfbcf5677e71c729ad11633e34f94021114cfc4c4cef90bf208

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page