Skip to main content

Minimalistic text search engine that uses sklearn and pandas

Project description

minsearch

A minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields. The library provides two implementations:

  1. Index: A basic search index using scikit-learn's TF-IDF vectorizer
  2. AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition

Features

  • Text field indexing with TF-IDF and cosine similarity
  • Keyword field filtering with exact matching
  • Field boosting for fine-tuning search relevance
  • Stop word removal and custom tokenization
  • Support for incremental document addition (AppendableIndex)
  • Customizable tokenizer patterns and stop words
  • Efficient search with filtering and boosting

Installation

pip install minsearch

Environment setup

To run it locally, make sure you have the required dependencies installed:

pip install pandas scikit-learn

Alternatively, use pipenv:

pipenv install --dev

Usage

Basic Search with Index

from minsearch import Index

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)

Incremental Search with AppendableIndex

from minsearch import AppendableIndex

# Create the index
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"]
)

# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)

doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)

# Search with custom stop words
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    stop_words={"the", "a", "an"}  # Custom stop words
)

Advanced Features

Custom Tokenizer Pattern

from minsearch import AppendableIndex

# Create index with custom tokenizer pattern
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer_pattern=r'[\s\W\d]+'  # Custom pattern to split on whitespace, non-word chars, and digits
)

Field Boosting

# Boost certain fields to increase their importance in search
boost_dict = {
    "title": 2.0,      # Title matches are twice as important
    "description": 1.0  # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)

Keyword Filtering

# Filter results by exact keyword matches
filter_dict = {
    "course": "CS101",
    "level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)

Examples

Interactive Notebook

The repository includes an interactive Jupyter notebook (minsearch_example.ipynb) that demonstrates the library's features using real-world data. The notebook shows:

  • Loading and preparing documents from a JSON source
  • Creating and configuring the search index
  • Performing searches with filters and boosts
  • Working with real course-related Q&A data

To run the notebook:

pipenv run jupyter notebook

Then open minsearch_example.ipynb in your browser.

Development

Running Tests

pipenv run pytest

Building and Publishing

  1. Install development dependencies:
pipenv install --dev twine build
  1. Build the package:
pipenv run python -m build
  1. Check the packages:
pipenv run twine check dist/*
  1. Upload to test PyPI:
pipenv run twine upload --repository-url https://test.pypi.org/legacy/ dist/*
  1. Upload to PyPI:
pipenv run twine upload dist/*
  1. Clean up:
rm -r build/ dist/ minsearch.egg-info/

Project Structure

  • minsearch/: Main package directory
    • minsearch.py: Core Index implementation using scikit-learn
    • append.py: AppendableIndex implementation with inverted index
  • tests/: Test suite
  • minsearch_example.ipynb: Example notebook
  • setup.py: Package configuration
  • Pipfile: Development dependencies

Note: The minsearch.py file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minsearch-0.0.3.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minsearch-0.0.3-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file minsearch-0.0.3.tar.gz.

File metadata

  • Download URL: minsearch-0.0.3.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b3644dcca421ecc36efdb0e5d56559aa6a14145804d155a2db416fa9a7590eda
MD5 2d78c0034bab021b1bf40c7f9fa7c196
BLAKE2b-256 db16076e26a29aff92bb59da3dee98a238e504739930b0285604445cd2c15c99

See more details on using hashes here.

File details

Details for the file minsearch-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: minsearch-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cbf898d7d436c443a2901e8119632c511752093b29830c944954512e24d2ba07
MD5 aea4f82e3a518e421740ad7b65aa5fb8
BLAKE2b-256 7a099b401408e312da3fafc8a869979bef8acebd0c8fdd0fe2a29eb12b08bea4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page