Yet Another BM25 algorithm implementation with additional per document vectors
Project description
yeabm25
Yet Another BM25 algorithm implementation with helpful implementation of:
- functionallity to update the index with .update() method. In fack you can use just update.
- per document vector.
from yeabm25 import YeaBM25
import nltk
nltk.download('stopwords', quiet=True)
stopwords_en = set(stopwords.words('english'))
def normalize_for_bm(text: str):
text = re.sub("[^a-zA-z1-9]", " ", text)
words = text.lower().split()
return [word for word in words if word not in stopwords_en]
corpus = ["The quick brown fox jumps over the lazy dog",
"The lazy dog is brown",
"The fox is brown",
"Hello there good man!",
"It is quite windy in London",
"How is the weather today man?",
]
normalized_corpus = [normalize_for_bm(txt) for txt in corpus]
# fitting the whole corpus
bm_index = YeaBM25(epsilon=1)
bm_index.fit(normalized_corpus)
# fit and then update
bm_update = YeaBM25(epsilon=1)
bm_update.fit(normalized_corpus[:3])
bm_update.update(normalized_corpus[3:])
print(bm_index.doc_len == bm_update.doc_len)
print(bm_index.average_idf == bm_update.average_idf)
print(bm_index.idf == bm_update.idf)
print((bm_index.get_scores(['fox', 'jump']) == bm_update.get_scores(['fox', 'jump'])).all())
This work is inspired(and uses some code and ideas) by this great package - https://github.com/dorianbrown/rank_bm25/tree/master. The main focus is creating document and query vectors (sparse vectors soon). Then using the vectors with your favourite Vector DB.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
yeabm25-0.1.0.tar.gz
(15.6 kB
view details)
Built Distribution
yeabm25-0.1.0-py3-none-any.whl
(13.6 kB
view details)
File details
Details for the file yeabm25-0.1.0.tar.gz
.
File metadata
- Download URL: yeabm25-0.1.0.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e828a3beb1a1abf40f016feaf64fef8777759c353a97763180348a560996a5d |
|
MD5 | 58ab2357a0ccd171267452e5e4e28569 |
|
BLAKE2b-256 | f5c6e8227d6b9c664fcc728731c06b63f858f239c5fddfc4badfd7409979c132 |
File details
Details for the file yeabm25-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: yeabm25-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2915685b9ffb558237f0df204c119eaea9e7a7b3e3453b22ad3ff9da6176441 |
|
MD5 | 4ba7e6c083324af16d76df0a91f94a7a |
|
BLAKE2b-256 | 8358c33677358dfbcf5677e71c729ad11633e34f94021114cfc4c4cef90bf208 |