Skip to main content

No project description provided

Project description

Full text search engine - LMDB/BM25 based

This project is a simple, yet powerful, full-text search engine written in Python. It's designed to be easy to use, thread-safe, and efficient for a variety of search tasks.

Key Features

  • Breaking Change in v1.1: This version introduces a new on-disk format for the metadata index to enable efficient filtering. If you are upgrading from a previous version, you must re-index your data.
  • Easy to Use: The SearchEngine class provides a simple API for storing and searching documents.
  • Metadata Support: Store and query documents based on metadata fields.
  • Fast: Uses a combination of FlashText for quick keyword matching and a BM25 vectorizer for more complex queries.
  • Scalable: The sharded storage backend allows the engine to handle large amounts of data.
  • Thread-Safe: The underlying LMDB storage is thread-safe, making it suitable for multi-threaded environments.

Usage

Here's a quick example of how to use the search engine:

from engine import SearchEngine
import os
import shutil

# Define paths for the storage directories
storage_path = "./db"
metadata_path = "./db_metadata"
metadata_index_path = "./db_metadata_index"
matrix_path = "./matrix"

# Clean up previous runs if they exist
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

# 1. Initialize the Search Engine
search_engine = SearchEngine(
    storage_base_path=storage_path,
    metadata_storage_base_path=metadata_path,
    metadata_index_storage_base_path=metadata_index_path,
    matrix_path=matrix_path
)

# 2. Store some documents with metadata
docs = [
    ("The quick brown fox jumps over the lazy dog", {"source": "proverb"}),
    ("A journey of a thousand miles begins with a single step", {"source": "proverb"}),
    ("The early bird catches the worm", {"source": "proverb"}),
    ("An apple a day keeps the doctor away", {"source": "health"}),
]

for text, metadata in docs:
    search_engine.store_data(text, metadata)

print("Stored 4 documents.")

# 3. Pre-compute the index for optimal performance
print("Building search index...")
search_engine.index()
print("Index built.")

# 4. Perform a search
query = "quick fox"
results = search_engine.search(query, {})

print(f"\nSearching for: '{query}'")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'proverb'}: 'The quick brown fox jumps over the lazy dog'

# 5. Perform a search with a metadata filter
query = "apple"
metadata_query = {"source": "health"}
results = search_engine.search(query, metadata_query)

print(f"\nSearching for: '{query}' with metadata filter {metadata_query}")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'health'}: 'An apple a day keeps the doctor away'

# 6. Perform a search with advanced metadata filtering
print("\n--- Advanced Metadata Filtering ---")

# Search for documents where the author is either "John" or "Sarah"
docs_authors = [
    ("Text by John", {"author": "John", "year": 2020}),
    ("Text by Sarah", {"author": "Sarah", "year": 2021}),
    ("Text by Mike", {"author": "Mike", "year": 2022}),
]
for text, metadata in docs_authors:
    search_engine.store_data(text, metadata)
search_engine.index()

# Using the $in operator
query_in = "Text"
metadata_in = {"author": {"$in": ["John", "Sarah"]}}
results_in = search_engine.search(query_in, metadata_in)
print(f"Searching for '{query_in}' with metadata {metadata_in}:")
for _, text, _ in results_in:
    print(f"  - Found: '{text}'")

# Using the $gte operator for a range query
query_gte = "Text"
metadata_gte = {"year": {"$gte": 2021}}
results_gte = search_engine.search(query_gte, metadata_gte)
print(f"\nSearching for '{query_gte}' with metadata {metadata_gte}:")
for _, text, _ in results_gte:
    print(f"  - Found: '{text}'")

# Example with datetime
import datetime
now = datetime.datetime.now()
docs_dates = [
    ("Event today", {"date": now}),
    ("Event tomorrow", {"date": (now + datetime.timedelta(days=1))}),
]
for text, metadata in docs_dates:
    search_engine.store_data(text, metadata)
search_engine.index()

query_date = "Event"
metadata_date = {"date": {"$gte": now}}
results_date = search_engine.search(query_date, metadata_date)
print(f"\nSearching for '{query_date}' with metadata {metadata_date}:")
for _, text, _ in results_date:
    print(f"  - Found: '{text}'")

# Clean up the storage directories
search_engine.storage.close()
search_engine.metadata_storage.close()
search_engine.metadata_index_storage.close()
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

Low-Memory Architecture

This search engine is designed to be both fast and memory-efficient. It achieves this by using a memory-mapped sparse matrix for the search index.

  • Indexing: The index() method builds the full document-term matrix in memory (a one-time cost) and then saves it to disk.
  • Searching: For subsequent searches, the matrix is loaded back as a memory-mapped object. This allows the operating system to efficiently manage paging the index between RAM and disk, providing the speed of an in-memory index without requiring the entire matrix to be loaded into RAM at once.

This approach provides a good balance between performance and memory usage, allowing the engine to handle large datasets with a small, predictable memory footprint.

Performance

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric Value
Number of documents 1000
Document size (chars) 500
Storage throughput 607.57 docs/sec
Search throughput 66.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.2.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

full_text_sparse_engine-0.2.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file full_text_sparse_engine-0.2.0.tar.gz.

File metadata

  • Download URL: full_text_sparse_engine-0.2.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f42455b83e3601b2a4c4d8c27fd7ff01da1348c8fd8f23d3ea760c9b58e3380a
MD5 abef55131f529466a89bcc0a12f2f31b
BLAKE2b-256 3f619d063bef1ee41d5108bc124bb117bb5b2594cfc837f0aee5129e1fb4ae21

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for full_text_sparse_engine-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74c71a244dd0348d393fc29d66bc62f58861fbaea69ba945bb49dd460cbd8c25
MD5 a28cc1a8b0a4f891d576e3c598033be4
BLAKE2b-256 b64ada701561a0651adbe7db633a2920ec5ec264f0a6756298eb4d277e1c3f34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page