BM25 NLP model
Project description
BagModels
BagModels is a repository of various bag of words (BoW) algorithms in machine learning. Currently it includes OkapiBM25. More coming soon.
BM25 is a text retrieval function that can find similar documents or rank search in a set of documents based on the query terms appearing in each document irrespective of their proximity to each other. It is an improved and more generalised version of TF-IDF algorithm in NLP.
Installation
It can be installed using pip:
pip install bagmodels
Getting started
Basic usage
import re
from bagmodels import BM25
# Load corpus
corpus = list({
"Yo, I love NLP model",
"I like algorithms",
"I love ML!"
})
# Clean manually if needed or pass custom tokenizer to BM25
corpus = [re.sub(r",|!", " ", doc).strip() for doc in corpus]
# Initialize model
model = BM25(corpus=corpus)
# Similarity
model.similarity("I love NLP model", "I like NLP model") # 0.775
model.similarity("I love blah", "I love algorithms") # 0.446
Save and reuse models
# libaries imported and corpus already loaded before it
model = BM25(corpus=corpus)
# write to save path
model.save("output/bm25_v1.jbl")
# load again
model = BM25.load("output/bm25_v1.jbl")
# add documents if required
model.resume(corpus=additonal_corpus)
# predict / search / find / retrieve like
model.similarity(doc_a, doc_b)
Coming soon
Please feel free to open an issue to request a feature or discuss any changes. Pull requests are most welcome.
I am trying to actively add the following:
- OkapiBM25
- BM25 variations
- MultiThreading
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bagmodels-0.1.5.tar.gz
(5.8 kB
view hashes)
Built Distribution
Close
Hashes for bagmodels-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19c74c5738736e5c2f503bc59357d98a89d0d4f7ba8f360972c2a634ebf9873d |
|
MD5 | c8ee4c2c9eab718c26f638e31f465c22 |
|
BLAKE2b-256 | a99232d4b985d1477c5b48195b9af8890641c88e95cdd0ae7d659947d2aad253 |