Skip to main content

No project description provided

Project description

Full text search engine - LMDB/BM25 based

This project is a simple, yet powerful, full-text search engine written in Python. It's designed to be easy to use, thread-safe, and efficient for a variety of search tasks.

Key Features

  • Easy to Use: The SearchEngine class provides a simple API for storing and searching documents.
  • Metadata Support: Store and query documents based on metadata fields.
  • Fast: Uses a combination of FlashText for quick keyword matching and a BM25 vectorizer for more complex queries.
  • Scalable: The sharded storage backend allows the engine to handle large amounts of data.
  • Thread-Safe: The underlying LMDB storage is thread-safe, making it suitable for multi-threaded environments.

Usage

Here's a quick example of how to use the search engine:

from engine import SearchEngine
import os
import shutil

# Define paths for the storage directories
storage_path = "./db"
metadata_path = "./db_metadata"
metadata_index_path = "./db_metadata_index"
matrix_path = "./matrix"

# Clean up previous runs if they exist
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

# 1. Initialize the Search Engine
search_engine = SearchEngine(
    storage_base_path=storage_path,
    metadata_storage_base_path=metadata_path,
    metadata_index_storage_base_path=metadata_index_path,
    matrix_path=matrix_path
)

# 2. Store some documents with metadata
docs = [
    ("The quick brown fox jumps over the lazy dog", {"source": "proverb"}),
    ("A journey of a thousand miles begins with a single step", {"source": "proverb"}),
    ("The early bird catches the worm", {"source": "proverb"}),
    ("An apple a day keeps the doctor away", {"source": "health"}),
]

for text, metadata in docs:
    search_engine.store_data(text, metadata)

print("Stored 4 documents.")

# 3. Pre-compute the index for optimal performance
print("Building search index...")
search_engine.index()
print("Index built.")

# 4. Perform a search
query = "quick fox"
results = search_engine.search(query, {})

print(f"\nSearching for: '{query}'")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'proverb'}: 'The quick brown fox jumps over the lazy dog'

# 5. Perform a search with a metadata filter
query = "apple"
metadata_query = {"source": "health"}
results = search_engine.search(query, metadata_query)

print(f"\nSearching for: '{query}' with metadata filter {metadata_query}")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'health'}: 'An apple a day keeps the doctor away'

# 6. Perform a search with advanced metadata filtering
print("\n--- Advanced Metadata Filtering ---")

# Search for documents where the author is either "John" or "Sarah"
docs_authors = [
    ("Text by John", {"author": "John", "year": 2020}),
    ("Text by Sarah", {"author": "Sarah", "year": 2021}),
    ("Text by Mike", {"author": "Mike", "year": 2022}),
]
for text, metadata in docs_authors:
    search_engine.store_data(text, metadata)
search_engine.index()

# Using the $in operator
query_in = "Text"
metadata_in = {"author": {"$in": ["John", "Sarah"]}}
results_in = search_engine.search(query_in, metadata_in)
print(f"Searching for '{query_in}' with metadata {metadata_in}:")
for _, text, _ in results_in:
    print(f"  - Found: '{text}'")

# Using the $gte operator for a range query
query_gte = "Text"
metadata_gte = {"year": {"$gte": 2021}}
results_gte = search_engine.search(query_gte, metadata_gte)
print(f"\nSearching for '{query_gte}' with metadata {metadata_gte}:")
for _, text, _ in results_gte:
    print(f"  - Found: '{text}'")

# Example with datetime
import datetime
now = datetime.datetime.now()
docs_dates = [
    ("Event today", {"date": now}),
    ("Event tomorrow", {"date": (now + datetime.timedelta(days=1))}),
]
for text, metadata in docs_dates:
    search_engine.store_data(text, metadata)
search_engine.index()

query_date = "Event"
metadata_date = {"date": {"$gte": now}}
results_date = search_engine.search(query_date, metadata_date)
print(f"\nSearching for '{query_date}' with metadata {metadata_date}:")
for _, text, _ in results_date:
    print(f"  - Found: '{text}'")

# Clean up the storage directories
search_engine.storage.close()
search_engine.metadata_storage.close()
search_engine.metadata_index_storage.close()
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

Low-Memory Architecture

This search engine is designed to be both fast and memory-efficient. It achieves this by using a memory-mapped sparse matrix for the search index.

  • Indexing: The index() method builds the full document-term matrix in memory (a one-time cost) and then saves it to disk.
  • Searching: For subsequent searches, the matrix is loaded back as a memory-mapped object. This allows the operating system to efficiently manage paging the index between RAM and disk, providing the speed of an in-memory index without requiring the entire matrix to be loaded into RAM at once.

This approach provides a good balance between performance and memory usage, allowing the engine to handle large datasets with a small, predictable memory footprint.

Performance

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric Value
Number of documents 1000
Document size (chars) 500
Storage throughput 607.57 docs/sec
Search throughput 66.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.2.1.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

full_text_sparse_engine-0.2.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file full_text_sparse_engine-0.2.1.tar.gz.

File metadata

  • Download URL: full_text_sparse_engine-0.2.1.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.2.1.tar.gz
Algorithm Hash digest
SHA256 87e033fc7650fbd11ab17f51464582ef8fc5e91524c36c02371cef60bfb27928
MD5 fc42debfabe35f4109138dbe5b22e793
BLAKE2b-256 a5f2a42cdddacd04220d9d7389092b33595b72557d14fa6715a9626afc37a3f0

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for full_text_sparse_engine-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 80be2fd03394881977114a5ee16a412f0cbed2bc6487b6898f982eef24a8dba1
MD5 787fe3b9a302aa2984c528ef35ce1425
BLAKE2b-256 e4e50ae9569b2986f9f18b5c668a5fbbfe8ca49a5344fb1d51ca103a6a7722c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page