Skip to main content

BM25 NLP model

Project description

BagModels

PyPI version Python tests

BagModels is a repository of various bag of words (BoW) algorithms in machine learning. Currently it includes OkapiBM25. More coming soon.

BM25 is a text retrieval function that can find similar documents or rank search in a set of documents based on the query terms appearing in each document irrespective of their proximity to each other. It is an improved and more generalised version of TF-IDF algorithm in NLP.

Installation

It can be installed using pip:

pip install bagmodels

Getting started

Basic usage

import re
from bagmodels import BM25

# Load corpus
corpus = list({
    "Yo, I love NLP model",
    "I like algorithms",
    "I love ML!"
})

# Clean manually if needed or pass custom tokenizer to BM25
corpus = [re.sub(r",|!", " ", doc).strip() for doc in corpus]

# Initialize model
model = BM25(corpus=corpus)

# Similarity
model.similarity("I love NLP model", "I like NLP model") # 0.775
model.similarity("I love blah", "I love algorithms") # 0.446

Save and reuse models

# libaries imported and corpus already loaded before it
model = BM25(corpus=corpus)

# write to save path
model.save("output/bm25_v1.jbl")

# load again
model = BM25.load("output/bm25_v1.jbl")

# add documents if required
model.resume(corpus=additonal_corpus)

# predict / search / find / retrieve like
model.similarity(doc_a, doc_b)

Coming soon

Please feel free to open an issue to request a feature or discuss any changes. Pull requests are most welcome.

I am trying to actively add the following:

  • OkapiBM25
  • BM25 variations
  • MultiThreading

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bagmodels-0.1.5.tar.gz (5.8 kB view hashes)

Uploaded Source

Built Distribution

bagmodels-0.1.5-py3-none-any.whl (6.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page