No project description provided
Project description
Full text search engine - LMDB/BM25 based
This project is a simple, yet powerful, full-text search engine written in Python. It's designed to be easy to use, thread-safe, and efficient for a variety of search tasks.
Key Features
- Easy to Use: The
SearchEngineclass provides a simple API for storing and searching documents. - Metadata Support: Store and query documents based on metadata fields.
- Fast: Uses a combination of
FlashTextfor quick keyword matching and aBM25vectorizer for more complex queries. - Scalable: The sharded storage backend allows the engine to handle large amounts of data.
- Thread-Safe: The underlying LMDB storage is thread-safe, making it suitable for multi-threaded environments.
Usage
Here's a quick example of how to use the search engine:
from engine import SearchEngine
import os
import shutil
# Define paths for the storage directories
storage_path = "./db"
metadata_path = "./db_metadata"
metadata_index_path = "./db_metadata_index"
matrix_path = "./matrix"
# Clean up previous runs if they exist
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
if os.path.exists(path):
shutil.rmtree(path)
# 1. Initialize the Search Engine
search_engine = SearchEngine(
storage_base_path=storage_path,
metadata_storage_base_path=metadata_path,
metadata_index_storage_base_path=metadata_index_path,
matrix_path=matrix_path
)
# 2. Store some documents with metadata
docs = [
("The quick brown fox jumps over the lazy dog", {"source": "proverb"}),
("A journey of a thousand miles begins with a single step", {"source": "proverb"}),
("The early bird catches the worm", {"source": "proverb"}),
("An apple a day keeps the doctor away", {"source": "health"}),
]
for text, metadata in docs:
search_engine.store_data(text, metadata)
print("Stored 4 documents.")
# 3. Pre-compute the index for optimal performance
print("Building search index...")
search_engine.index()
print("Index built.")
# 4. Perform a search
query = "quick fox"
results = search_engine.search(query, {})
print(f"\nSearching for: '{query}'")
for doc_id, text, metadata in results:
print(f" - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
# - Found doc ... with metadata {'source': 'proverb'}: 'The quick brown fox jumps over the lazy dog'
# 5. Perform a search with a metadata filter
query = "apple"
metadata_query = {"source": "health"}
results = search_engine.search(query, metadata_query)
print(f"\nSearching for: '{query}' with metadata filter {metadata_query}")
for doc_id, text, metadata in results:
print(f" - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
# - Found doc ... with metadata {'source': 'health'}: 'An apple a day keeps the doctor away'
# Clean up the storage directories
search_engine.storage.close()
search_engine.metadata_storage.close()
search_engine.metadata_index_storage.close()
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
if os.path.exists(path):
shutil.rmtree(path)
Low-Memory Architecture
This search engine is designed to be both fast and memory-efficient. It achieves this by using a memory-mapped sparse matrix for the search index.
- Indexing: The
index()method builds the full document-term matrix in memory (a one-time cost) and then saves it to disk. - Searching: For subsequent searches, the matrix is loaded back as a memory-mapped object. This allows the operating system to efficiently manage paging the index between RAM and disk, providing the speed of an in-memory index without requiring the entire matrix to be loaded into RAM at once.
This approach provides a good balance between performance and memory usage, allowing the engine to handle large datasets with a small, predictable memory footprint.
Performance
The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.
| Metric | Value |
|---|---|
| Number of documents | 1000 |
| Document size (chars) | 500 |
| Storage throughput | 742.73 docs/sec |
| Search throughput | 76.27 queries/sec |
These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file full_text_sparse_engine-0.1.0.tar.gz.
File metadata
- Download URL: full_text_sparse_engine-0.1.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0364bcd994a35004216f2db4c88f725252128f05c92b558652f88d50a5758799
|
|
| MD5 |
73f3661bec9660061d43e73c93b76fa5
|
|
| BLAKE2b-256 |
41c1e84b94d490806dbbd24ec4ca5d5f1b254b521a6d39cc5cb10c8cfb9c24f2
|
File details
Details for the file full_text_sparse_engine-0.1.0-py3-none-any.whl.
File metadata
- Download URL: full_text_sparse_engine-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c943bbf1839717860cc4526b660f51a88636067448ae3f4befd06971d0a8bbd
|
|
| MD5 |
b14505e4fd65ce8d4a59536674a7c19e
|
|
| BLAKE2b-256 |
5d8c2a57f81832df384513612f2de5dc23b4e6e910c7722402db5a223bc7a0d3
|