Minimalistic text search engine that uses sklearn and pandas
Project description
minsearch
A minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields. The library provides two implementations:
Index: A basic search index using scikit-learn's TF-IDF vectorizerAppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition
Features
- Text field indexing with TF-IDF and cosine similarity
- Keyword field filtering with exact matching
- Field boosting for fine-tuning search relevance
- Stop word removal and custom tokenization
- Support for incremental document addition (AppendableIndex)
- Customizable tokenizer patterns and stop words
- Efficient search with filtering and boosting
Installation
pip install minsearch
Environment setup
To run it locally, make sure you have the required dependencies installed:
pip install pandas scikit-learn
Alternatively, use pipenv:
pipenv install --dev
Usage
Basic Search with Index
from minsearch import Index
# Create documents
docs = [
{
"question": "How do I join the course after it has started?",
"text": "You can join the course at any time. We have recordings available.",
"section": "General Information",
"course": "data-engineering-zoomcamp"
},
{
"question": "What are the prerequisites for the course?",
"text": "You need to have basic knowledge of programming.",
"section": "Course Requirements",
"course": "data-engineering-zoomcamp"
}
]
# Create and fit the index
index = Index(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(docs)
# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}
results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)
Incremental Search with AppendableIndex
from minsearch import AppendableIndex
# Create the index
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"]
)
# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)
doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)
# Search with custom stop words
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"],
stop_words={"the", "a", "an"} # Custom stop words
)
Advanced Features
Custom Tokenizer Pattern
from minsearch import AppendableIndex
# Create index with custom tokenizer pattern
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"],
tokenizer_pattern=r'[\s\W\d]+' # Custom pattern to split on whitespace, non-word chars, and digits
)
Field Boosting
# Boost certain fields to increase their importance in search
boost_dict = {
"title": 2.0, # Title matches are twice as important
"description": 1.0 # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)
Keyword Filtering
# Filter results by exact keyword matches
filter_dict = {
"course": "CS101",
"level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)
Examples
Interactive Notebook
The repository includes an interactive Jupyter notebook (minsearch_example.ipynb) that demonstrates the library's features using real-world data. The notebook shows:
- Loading and preparing documents from a JSON source
- Creating and configuring the search index
- Performing searches with filters and boosts
- Working with real course-related Q&A data
To run the notebook:
pipenv run jupyter notebook
Then open minsearch_example.ipynb in your browser.
Development
Running Tests
pipenv run pytest
Building and Publishing
- Install development dependencies:
pipenv install --dev twine build
- Build the package:
pipenv run python -m build
- Check the packages:
pipenv run twine check dist/*
- Upload to test PyPI:
pipenv run twine upload --repository-url https://test.pypi.org/legacy/ dist/*
- Upload to PyPI:
pipenv run twine upload dist/*
- Clean up:
rm -r build/ dist/ minsearch.egg-info/
Project Structure
minsearch/: Main package directoryminsearch.py: Core Index implementation using scikit-learnappend.py: AppendableIndex implementation with inverted index
tests/: Test suiteminsearch_example.ipynb: Example notebooksetup.py: Package configurationPipfile: Development dependencies
Note: The minsearch.py file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file minsearch-0.0.3.tar.gz.
File metadata
- Download URL: minsearch-0.0.3.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3644dcca421ecc36efdb0e5d56559aa6a14145804d155a2db416fa9a7590eda
|
|
| MD5 |
2d78c0034bab021b1bf40c7f9fa7c196
|
|
| BLAKE2b-256 |
db16076e26a29aff92bb59da3dee98a238e504739930b0285604445cd2c15c99
|
File details
Details for the file minsearch-0.0.3-py3-none-any.whl.
File metadata
- Download URL: minsearch-0.0.3-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbf898d7d436c443a2901e8119632c511752093b29830c944954512e24d2ba07
|
|
| MD5 |
aea4f82e3a518e421740ad7b65aa5fb8
|
|
| BLAKE2b-256 |
7a099b401408e312da3fafc8a869979bef8acebd0c8fdd0fe2a29eb12b08bea4
|