Skip to main content

Minimalistic text search engine that uses sklearn and pandas

Project description

minsearch

Minimalistic text search engine that uses sklearn and pandas.

This is a simple search library implemented using sklearn and pandas.

It allows you to index documents with text and keyword fields and perform search queries with support for filtering and boosting.

Installation

pip install minsearch

Environment setup

To run it locally, make sure you have the required dependencies installed:

pip install pandas scikit-learn

Alternatively, use pipenv:

pipenv install --dev

Usage

Here's how you can use the library:

Define Your Documents

Prepare your documents as a list of dictionaries. Each dictionary should have the text and keyword fields you want to index.

docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

Create the Index

Create an instance of the Index class, specifying the text and keyword fields.

from minsearch import Index

index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

Fit the index with your documents

index.fit(docs)

Perform a Search

Search the index with a query string, optional filter dictionary, and optional boost dictionary.

query = "Can I join the course if it has already started?"

filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict, boost_dict)

for result in results:
    print(result)

Notebook

Run it in a notebook to test it yourself

pipenv run jupyter notebook

File structure

There's minsearch folder and minsearch.py file in the root.

The file minsearch.py is kept there because it was used in the LLM Zoomcamp course, where we'd use wget to donwload it. To avoid breaking changes, we keep the file.

Publishing

Use twine for publishing and build for building

pipenv install --dev twine build

Generate a wheel:

pipenv run python -m build

Check the packages:

twine check dist/*

Upload the library to test PyPI to verify everything is working:

twine upload --repository-url https://test.pypi.org/legacy/ dist/*

Upload to PyPI:

twine upload dist/*

Done!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minsearch-0.0.1.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

minsearch-0.0.1-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file minsearch-0.0.1.tar.gz.

File metadata

  • Download URL: minsearch-0.0.1.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4b9bcfce81808e240ae924638b1f3f7c3b4200e959ea26caf55f24c3ebed8b04
MD5 9a8b90700e548e95fc8a9859caaa4cf6
BLAKE2b-256 34928c4baeb509cd1e5031f7217ff6c038ffd5f66cd1562f01e32a324d813a54

See more details on using hashes here.

File details

Details for the file minsearch-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: minsearch-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f504a280606f524cb55e3909e95618669245ae9413d35d16faa5d2c9ef32a11c
MD5 cef1909e5d0f857613448562d543a476
BLAKE2b-256 5d1228ba0dc4197f70a80f75a8eb550cb67ec87e186b15a3337a087ff8a69c56

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page