No project description provided

These details have not been verified by PyPI

Project description

Full text search engine - LMDB/BM25 based

This project is a simple, yet powerful, full-text search engine written in Python. It's designed to be easy to use, thread-safe, and efficient for a variety of search tasks.

Key Features

Breaking Change in v1.1: This version introduces a new on-disk format for the metadata index to enable efficient filtering. If you are upgrading from a previous version, you must re-index your data.
Easy to Use: The SearchEngine class provides a simple API for storing and searching documents.
Metadata Support: Store and query documents based on metadata fields.
Fast: Uses a combination of FlashText for quick keyword matching and a BM25 vectorizer for more complex queries.
Scalable: The sharded storage backend allows the engine to handle large amounts of data.
Thread-Safe: The underlying LMDB storage is thread-safe, making it suitable for multi-threaded environments.

Usage

Here's a quick example of how to use the search engine:

from engine import SearchEngine
import os
import shutil

# Define paths for the storage directories
storage_path = "./db"
metadata_path = "./db_metadata"
metadata_index_path = "./db_metadata_index"
matrix_path = "./matrix"

# Clean up previous runs if they exist
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

# 1. Initialize the Search Engine
search_engine = SearchEngine(
    storage_base_path=storage_path,
    metadata_storage_base_path=metadata_path,
    metadata_index_storage_base_path=metadata_index_path,
    matrix_path=matrix_path
)

# 2. Store some documents with metadata
docs = [
    ("The quick brown fox jumps over the lazy dog", {"source": "proverb"}),
    ("A journey of a thousand miles begins with a single step", {"source": "proverb"}),
    ("The early bird catches the worm", {"source": "proverb"}),
    ("An apple a day keeps the doctor away", {"source": "health"}),
]

for text, metadata in docs:
    search_engine.store_data(text, metadata)

print("Stored 4 documents.")

# 3. Pre-compute the index for optimal performance
print("Building search index...")
search_engine.index()
print("Index built.")

# 4. Perform a search
query = "quick fox"
results = search_engine.search(query, {})

print(f"\nSearching for: '{query}'")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'proverb'}: 'The quick brown fox jumps over the lazy dog'

# 5. Perform a search with a metadata filter
query = "apple"
metadata_query = {"source": "health"}
results = search_engine.search(query, metadata_query)

print(f"\nSearching for: '{query}' with metadata filter {metadata_query}")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'health'}: 'An apple a day keeps the doctor away'

# 6. Perform a search with advanced metadata filtering
print("\n--- Advanced Metadata Filtering ---")

# Search for documents where the author is either "John" or "Sarah"
docs_authors = [
    ("Text by John", {"author": "John", "year": 2020}),
    ("Text by Sarah", {"author": "Sarah", "year": 2021}),
    ("Text by Mike", {"author": "Mike", "year": 2022}),
]
for text, metadata in docs_authors:
    search_engine.store_data(text, metadata)
search_engine.index()

# Using the $in operator
query_in = "Text"
metadata_in = {"author": {"$in": ["John", "Sarah"]}}
results_in = search_engine.search(query_in, metadata_in)
print(f"Searching for '{query_in}' with metadata {metadata_in}:")
for _, text, _ in results_in:
    print(f"  - Found: '{text}'")

# Using the $gte operator for a range query
query_gte = "Text"
metadata_gte = {"year": {"$gte": 2021}}
results_gte = search_engine.search(query_gte, metadata_gte)
print(f"\nSearching for '{query_gte}' with metadata {metadata_gte}:")
for _, text, _ in results_gte:
    print(f"  - Found: '{text}'")

# Example with datetime
import datetime
now = datetime.datetime.now()
docs_dates = [
    ("Event today", {"date": now}),
    ("Event tomorrow", {"date": (now + datetime.timedelta(days=1))}),
]
for text, metadata in docs_dates:
    search_engine.store_data(text, metadata)
search_engine.index()

query_date = "Event"
metadata_date = {"date": {"$gte": now}}
results_date = search_engine.search(query_date, metadata_date)
print(f"\nSearching for '{query_date}' with metadata {metadata_date}:")
for _, text, _ in results_date:
    print(f"  - Found: '{text}'")

# Clean up the storage directories
search_engine.storage.close()
search_engine.metadata_storage.close()
search_engine.metadata_index_storage.close()
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

Low-Memory Architecture

This search engine is designed to be both fast and memory-efficient. It achieves this by using a memory-mapped sparse matrix for the search index.

Indexing: The index() method builds the full document-term matrix in memory (a one-time cost) and then saves it to disk.
Searching: For subsequent searches, the matrix is loaded back as a memory-mapped object. This allows the operating system to efficiently manage paging the index between RAM and disk, providing the speed of an in-memory index without requiring the entire matrix to be loaded into RAM at once.

This approach provides a good balance between performance and memory usage, allowing the engine to handle large datasets with a small, predictable memory footprint.

Performance

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric	Value
Number of documents	1000
Document size (chars)	500
Storage throughput	607.57 docs/sec
Search throughput	66.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.0

Dec 8, 2025

0.3.0

Dec 8, 2025

0.2.1

Dec 4, 2025

This version

0.2.0

Dec 4, 2025

0.1.0

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.2.0.tar.gz (10.5 kB view details)

Uploaded Dec 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

full_text_sparse_engine-0.2.0-py3-none-any.whl (11.1 kB view details)

Uploaded Dec 4, 2025 Python 3

File details

Details for the file full_text_sparse_engine-0.2.0.tar.gz.

File metadata

Download URL: full_text_sparse_engine-0.2.0.tar.gz
Upload date: Dec 4, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f42455b83e3601b2a4c4d8c27fd7ff01da1348c8fd8f23d3ea760c9b58e3380a`
MD5	`abef55131f529466a89bcc0a12f2f31b`
BLAKE2b-256	`3f619d063bef1ee41d5108bc124bb117bb5b2594cfc837f0aee5129e1fb4ae21`

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.2.0-py3-none-any.whl.

File metadata

Download URL: full_text_sparse_engine-0.2.0-py3-none-any.whl
Upload date: Dec 4, 2025
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74c71a244dd0348d393fc29d66bc62f58861fbaea69ba945bb49dd460cbd8c25`
MD5	`a28cc1a8b0a4f891d576e3c598033be4`
BLAKE2b-256	`b64ada701561a0651adbe7db633a2920ec5ec264f0a6756298eb4d277e1c3f34`

See more details on using hashes here.

full-text-sparse-engine 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Full text search engine - LMDB/BM25 based

Key Features

Usage

Low-Memory Architecture

Performance

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes