Minimalistic text search engine that uses sklearn and pandas

These details have not been verified by PyPI

Project links

Homepage

Project description

minsearch

A minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields. The library provides two implementations:

Index: A basic search index using scikit-learn's TF-IDF vectorizer
AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition

Features

Text field indexing with TF-IDF and cosine similarity
Keyword field filtering with exact matching
Field boosting for fine-tuning search relevance
Stop word removal and custom tokenization
Support for incremental document addition (AppendableIndex)
Customizable tokenizer patterns and stop words
Efficient search with filtering and boosting

Installation

pip install minsearch

Environment setup

To run it locally, make sure you have the required dependencies installed:

pip install pandas scikit-learn

Alternatively, use pipenv:

pipenv install --dev

Usage

Basic Search with Index

from minsearch import Index

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)

Incremental Search with AppendableIndex

from minsearch import AppendableIndex

# Create the index
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"]
)

# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)

doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)

# Search with custom stop words
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    stop_words={"the", "a", "an"}  # Custom stop words
)

Advanced Features

Custom Tokenizer Pattern

from minsearch import AppendableIndex

# Create index with custom tokenizer pattern
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer_pattern=r'[\s\W\d]+'  # Custom pattern to split on whitespace, non-word chars, and digits
)

Field Boosting

# Boost certain fields to increase their importance in search
boost_dict = {
    "title": 2.0,      # Title matches are twice as important
    "description": 1.0  # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)

Keyword Filtering

# Filter results by exact keyword matches
filter_dict = {
    "course": "CS101",
    "level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)

Examples

Interactive Notebook

The repository includes an interactive Jupyter notebook (minsearch_example.ipynb) that demonstrates the library's features using real-world data. The notebook shows:

Loading and preparing documents from a JSON source
Creating and configuring the search index
Performing searches with filters and boosts
Working with real course-related Q&A data

To run the notebook:

pipenv run jupyter notebook

Then open minsearch_example.ipynb in your browser.

Development

Running Tests

pipenv run pytest

Building and Publishing

Install development dependencies:

pipenv install --dev twine build

Build the package:

pipenv run python -m build

Check the packages:

pipenv run twine check dist/*

Upload to test PyPI:

pipenv run twine upload --repository-url https://test.pypi.org/legacy/ dist/*

Upload to PyPI:

pipenv run twine upload dist/*

Clean up:

rm -r build/ dist/ minsearch.egg-info/

Project Structure

minsearch/: Main package directory
- minsearch.py: Core Index implementation using scikit-learn
- append.py: AppendableIndex implementation with inverted index
tests/: Test suite
minsearch_example.ipynb: Example notebook
setup.py: Package configuration
Pipfile: Development dependencies

Note: The minsearch.py file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.10

Feb 16, 2026

0.0.9

Feb 10, 2026

0.0.8

Feb 6, 2026

0.0.7

Oct 4, 2025

0.0.6

Oct 4, 2025

0.0.5

Sep 26, 2025

0.0.4

Jul 11, 2025

This version

0.0.3

Jun 19, 2025

0.0.2

Sep 30, 2024

0.0.1

Sep 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minsearch-0.0.3.tar.gz (14.7 kB view details)

Uploaded Jun 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

minsearch-0.0.3-py3-none-any.whl (9.3 kB view details)

Uploaded Jun 19, 2025 Python 3

File details

Details for the file minsearch-0.0.3.tar.gz.

File metadata

Download URL: minsearch-0.0.3.tar.gz
Upload date: Jun 19, 2025
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b3644dcca421ecc36efdb0e5d56559aa6a14145804d155a2db416fa9a7590eda`
MD5	`2d78c0034bab021b1bf40c7f9fa7c196`
BLAKE2b-256	`db16076e26a29aff92bb59da3dee98a238e504739930b0285604445cd2c15c99`

See more details on using hashes here.

File details

Details for the file minsearch-0.0.3-py3-none-any.whl.

File metadata

Download URL: minsearch-0.0.3-py3-none-any.whl
Upload date: Jun 19, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for minsearch-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbf898d7d436c443a2901e8119632c511752093b29830c944954512e24d2ba07`
MD5	`aea4f82e3a518e421740ad7b65aa5fb8`
BLAKE2b-256	`7a099b401408e312da3fafc8a869979bef8acebd0c8fdd0fe2a29eb12b08bea4`

See more details on using hashes here.

minsearch 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

minsearch

Features

Installation

Environment setup

Usage

Basic Search with Index

Incremental Search with AppendableIndex

Advanced Features

Custom Tokenizer Pattern

Field Boosting

Keyword Filtering

Examples

Interactive Notebook

Development

Running Tests

Building and Publishing

Project Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes