Minimalistic text search engine that uses sklearn and pandas

These details have not been verified by PyPI

Project links

Project description

minsearch

A minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:

Index: A basic search index using scikit-learn's TF-IDF vectorizer for text fields
AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition
VectorSearch: A vector search index using cosine similarity for pre-computed vectors

Features

Text field indexing with TF-IDF and cosine similarity
Vector search with cosine similarity for pre-computed embeddings
Field boosting for fine-tuning search relevance (text-based search)
Extensive filtering capabilities (exact match and ranges)
Support for incremental document addition (AppendableIndex and VectorSearch)
Customizable tokenizer patterns and stop words
Result highlighting with configurable formatting

Installation

We recommend to use uv:

uv add minsearch

Or, install with pip:

pip install minsearch

Note: minsearch requires Python 3.10 or later.

Environment setup

For development purposes, use uv:

# Install uv if you haven't already
pip install uv
uv sync --extra dev

Usage

Basic Search with Index

from minsearch import Index

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)

Incremental Search with AppendableIndex

from minsearch import AppendableIndex

# Create the index
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"]
)

# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)

doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)

# Search
results = index.search("python programming")

Vector Search with VectorSearch

from minsearch import VectorSearch
import numpy as np

# Create sample vectors and payload documents
vectors = np.random.rand(100, 768)  # 100 documents, 768-dimensional vectors
payload = [
    {"id": 1, "title": "Python Tutorial", "category": "programming", "level": "beginner"},
    {"id": 2, "title": "Data Science Guide", "category": "data", "level": "intermediate"},
    {"id": 3, "title": "Machine Learning Basics", "category": "ai", "level": "advanced"},
    # ... more documents
]

# Create and fit the vector search index
index = VectorSearch(keyword_fields=["category", "level"])
index.fit(vectors, payload)

# Search with a query vector
query_vector = np.random.rand(768)  # 768-dimensional query vector
filter_dict = {"category": "programming", "level": "beginner"}

results = index.search(query_vector, filter_dict=filter_dict, num_results=5)

Incremental Vector Search

VectorSearch also supports appending vectors incrementally:

from minsearch import VectorSearch
import numpy as np

# Create the index
index = VectorSearch(keyword_fields=["category", "level"])

# Append a single vector
vector = np.random.rand(768)
doc = {"id": 1, "title": "Python Tutorial", "category": "programming", "level": "beginner"}
index.append(vector, doc)

# Append multiple vectors in batch
vectors = np.random.rand(10, 768)
payload = [
    {"id": i+2, "title": f"Document {i+2}", "category": "data", "level": "intermediate"}
    for i in range(10)
]
index.append_batch(vectors, payload)

# Search works the same way
query_vector = np.random.rand(768)
results = index.search(query_vector, num_results=5)

Custom Tokenizer

from minsearch import AppendableIndex
from minsearch.tokenizer import Tokenizer

tokenizer = Tokenizer(
    stop_words='english',  # Use default English stop words
    stemmer='porter'       # Apply Porter stemming
)

index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer=tokenizer
)

Field Boosting (Text-based Search)

# Boost certain fields to increase their importance in search
boost_dict = {
    "title": 2.0,       # Title matches are twice as important
    "description": 1.0  # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)

Filtering

All filter types (keyword, numeric range, date/time range) work with all search index types: Index, AppendableIndex, and VectorSearch.

Keyword Filtering

Filter results by exact keyword matches:

filter_dict = {
    "course": "CS101",
    "level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)

Numeric Range Filtering

Filter results by numeric values using comparison operators:

from minsearch import AppendableIndex

docs = [
    {"title": "Python Basics", "price": 29.99, "rating": 4.5},
    {"title": "Advanced Python", "price": 49.99, "rating": 4.8},
    {"title": "Python Masterclass", "price": 99.99, "rating": 4.9},
]

index = AppendableIndex(
    text_fields=["title"],
    keyword_fields=[],
    numeric_fields=["price", "rating"]
)
index.fit(docs)

# Price greater than or equal to 40
results = index.search("python", filter_dict={"price": [(">=", 40)]})

# Rating between 4.5 and 4.9
results = index.search("python", filter_dict={"rating": [(">=", 4.5), ("<=", 4.9)]})

# Multiple numeric filters
results = index.search(
    "python",
    filter_dict={
        "price": [("<", 100)],
        "rating": [(">=", 4.7)]
    }
)

Supported operators: == (equals), != (not equals), > (greater than), >= (greater than or equal), < (less than), <= (less than or equal).

Date/Time Range Filtering

Filter results by date and time values:

from datetime import datetime, date
from minsearch import AppendableIndex

docs = [
    {
        "title": "Python Course",
        "start_date": date(2024, 1, 15),
        "created_at": datetime(2023, 12, 1, 10, 30)
    },
    {
        "title": "Data Science Course",
        "start_date": date(2024, 2, 1),
        "created_at": datetime(2023, 12, 15, 14, 0)
    },
]

index = AppendableIndex(
    text_fields=["title"],
    keyword_fields=[],
    date_fields=["start_date", "created_at"]
)
index.fit(docs)

# Courses starting after a specific date
results = index.search(
    "python",
    filter_dict={"start_date": [(">", date(2024, 1, 1))]}
)

# Courses created in a date range
results = index.search(
    "course",
    filter_dict={
        "created_at": [
            (">=", datetime(2023, 12, 1)),
            ("<=", datetime(2023, 12, 31))
        ]
    }
)

Date fields accept date, datetime, or pandas.Timestamp objects.

Combined Filtering

You can combine keyword, numeric, and date filters in a single query:

results = index.search(
    "python course",
    filter_dict={
        "level": "advanced",                      # Keyword filter
        "price": [("<", 100)],                    # Numeric filter
        "start_date": [(">", date(2024, 1, 1))]   # Date filter
    }
)

Result Highlighting

The Highlighter class works with search results from any index type (Index, AppendableIndex, or VectorSearch). It extracts highlighted snippets from search results, showing where the query terms match in the text:

from minsearch import AppendableIndex, Highlighter, Tokenizer

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available for all sessions.",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "Can I get a refund if I drop the course?",
        "text": "Refunds are available within the first 30 days of enrollment.",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
tokenizer = Tokenizer(
    stop_words='english',
    stemmer='porter'
)

index = AppendableIndex(
    text_fields=["question", "text"],
    keyword_fields=["course"],
    tokenizer=tokenizer
)
index.fit(docs)

# Search
results = index.search("join course", num_results=1)

# Create highlighter
highlighter = Highlighter(
    highlight_fields=["question", "text"],
    skip_fields=["course"],
    max_matches=3,
    snippet_size=150,
    highlight_format="**",  # Bold with markdown
    tokenizer=tokenizer
)

# Highlight results
highlighted = highlighter.highlight("join course", results)

Example output:

{
    "question": {
        "matches": ["How do I **join** the **course** after it has started?"],
        "total_matches": 1
    },
    "text": {
        "matches": ["You can **join** the **course** at any time. We have recordings available for..."],
        "total_matches": 1
    },
    "course": "data-engineering-zoomcamp"
}

Highlighter options:

highlight_fields: List of field names to extract highlights from
skip_fields: List of field names to exclude from output (pass-through only)
max_matches: Maximum number of matches to return per field (default: 5)
snippet_size: Maximum characters per match snippet (default: 200)
highlight_format: Format for highlights - can be a string delimiter, tuple (open, close), or callable
tokenizer: Tokenizer to use (must match the index's tokenizer for best results)

Custom highlight formats:

# Markdown bold (default)
highlighter = Highlighter(..., highlight_format="**")  # **text**

# HTML
highlighter = Highlighter(..., highlight_format=("<b>", "</b>"))  # <b>text</b>

# ANSI for terminal
highlighter = Highlighter(..., highlight_format="\033[1m")  # text (bold)

# Custom function
highlighter = Highlighter(..., highlight_format=lambda t: f"[{t}]")  # [text]

Stemming

Stemming reduces words to their root form, improving search recall by matching different word forms. For example, "running", "runs", and "ran" all stem to "run".

from minsearch import AppendableIndex
from minsearch.tokenizer import Tokenizer

# Use stemming with the default English stop words
tokenizer = Tokenizer(
    stop_words='english',
    stemmer='snowball'  # Options: 'porter', 'snowball', 'lancaster', or None
)

index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer=tokenizer
)

# Now "joining" will match "join", "joined", "joins", etc.
results = index.search("joining the course")

Stemmer Comparison

Porter - Original Porter algorithm (1980)
- Well-established, fast, good for English
- Some edge cases, less aggressive than Snowball
- Good default for general use
Snowball (Porter2) - Improved Porter algorithm
- Handles more edge cases, more accurate stemming, has official specification
- Slightly slower than Porter
- Recommended - best overall accuracy
Lancaster - Very aggressive stemming
- Reduces words to shortest stems, good for recall
- Can over-stem, may produce non-words
- Use when maximizing recall is critical
None - No stemming
- Preserves original words, fastest
- No morphological matching
- Use for exact matching only

Recommendation: Use the snowball stemmer for best overall accuracy. It's based on the official Snowball specification and handles more edge cases than Porter while being less aggressive than Lancaster.

Examples

Interactive Notebook

The repository includes an interactive Jupyter notebook (minsearch_example.ipynb) that demonstrates the library's features using real-world data. The notebook shows:

Loading and preparing documents from a JSON source
Creating and configuring the search index
Performing searches with filters and boosts
Working with real course-related Q&A data

To run the notebook:

uv run jupyter notebook

Then open minsearch_example.ipynb in your browser.

Development

Running Tests

uv run pytest

Building and Publishing

Install development dependencies:

uv sync --extra dev

Build the package:

uv run hatch build

Publish to test PyPI:

uv run hatch publish --repo test

Publish to PyPI:

uv run hatch publish

Clean up:

rm -r dist/

Or run

python publish.py

Note: For Hatch publishing, you'll need to configure your PyPI credentials in ~/.pypirc or use environment variables.

PyPI Credentials Setup

Create a .pypirc file in your home directory with your PyPI credentials:

[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-your-main-api-token-here

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-your-test-api-token-here

Important Notes:

Use __token__ as the username for API tokens
Get your tokens from PyPI and Test PyPI
Set file permissions: chmod 600 ~/.pypirc

Alternative: Environment Variables

export HATCH_INDEX_USER=__token__
export HATCH_INDEX_AUTH=your-pypi-token

Project Structure

minsearch/: Main package directory
- minsearch.py: Core Index implementation using scikit-learn
- append.py: AppendableIndex implementation with inverted index
- vector.py: VectorSearch implementation using cosine similarity
tests/: Test suite
minsearch_example.ipynb: Example notebook
setup.py: Package configuration
Pipfile: Development dependencies

Note: The minsearch.py file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.10

Feb 16, 2026

This version

0.0.9

Feb 10, 2026

0.0.8

Feb 6, 2026

0.0.7

Oct 4, 2025

0.0.6

Oct 4, 2025

0.0.5

Sep 26, 2025

0.0.4

Jul 11, 2025

0.0.3

Jun 19, 2025

0.0.2

Sep 30, 2024

0.0.1

Sep 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minsearch-0.0.9.tar.gz (133.2 kB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

minsearch-0.0.9-py3-none-any.whl (32.6 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file minsearch-0.0.9.tar.gz.

File metadata

Download URL: minsearch-0.0.9.tar.gz
Upload date: Feb 10, 2026
Size: 133.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for minsearch-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`68e36536605b0beaa90cfa0bf6632276ae26627fac5a8d8c17f642875066b064`
MD5	`55d337dbdb01bf5c17b2e94e87949672`
BLAKE2b-256	`24172c1634dc78dae8bf5bddeaa78f1fc7b7f70e877835f181e6cf794b9dde8f`

See more details on using hashes here.

File details

Details for the file minsearch-0.0.9-py3-none-any.whl.

File metadata

Download URL: minsearch-0.0.9-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for minsearch-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80cfbb8016a61be87c9228c94551b35f967bf2970bd6c965746c71322ed2abcb`
MD5	`adca73d619b64058cdf1de7d97552493`
BLAKE2b-256	`37e891178a16a9a33eea7729b97c759025d5eea6a1a852f7d4228b7a87c2c0cd`

See more details on using hashes here.

minsearch 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

minsearch

Features

Installation

Environment setup

Usage

Basic Search with Index

Incremental Search with AppendableIndex

Vector Search with VectorSearch

Incremental Vector Search

Custom Tokenizer

Field Boosting (Text-based Search)

Filtering

Keyword Filtering

Numeric Range Filtering

Date/Time Range Filtering

Combined Filtering

Result Highlighting

Stemming

Stemmer Comparison

Examples

Interactive Notebook

Development

Running Tests

Building and Publishing

PyPI Credentials Setup

Project Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes