Skip to main content

No project description provided

Project description

Full text search engine - Hybrid LMDB/Inverted Index

This project is a high-performance, persistent full-text search engine written in Python. It utilizes a Hybrid Architecture combining an inverted index for sub-100ms candidate filtering and an on-the-fly BM25 vectorizer for relevant ranking.

Key Features

  • Sub-100ms Search: optimized for real-time querying using a custom Inverted Index.
  • Bidirectional Tree Lookup: Finds documents using smart prefix expansion ("exon" -> "exonerated") and recursive reduction ("exonerações" -> "exonera").
  • Metadata Filtering: Powerful filtering with support for exact matches, ranges ($gt, $lt), and sets ($in).
  • Atomic Batching: High-throughput indexing (thousands of docs/sec) via atomic batch commits to LMDB.
  • Thread-Safe: Based on LMDB (Lightning Memory-Mapped Database).
  • CRUD Support: Fully supports Create, Read, Update, and Delete operations.

Architecture

Unlike traditional vector search engines that rely heavily on pre-computed large dense matrices, this engine uses a two-step process:

  1. Filtering (Inverted Index): A sharded inverted index maps tokens to document IDs. It uses a bidirectional strategy:
    • Forward Scan: Finds words starting with the query token (e.g., query "work" matches "working").
    • Backward Scan: If no exact match is found, it reduces the query token to find root words (e.g., query "working" matches "work").
  2. Ranking (On-the-fly BM25): Once candidates are filtered (by metadata and text), a lightweight IncrementalBM25 model is built instantly in memory for just those candidates to score and rank them.

Usage

Initialization

from engine import SearchEngine
import os, shutil

# Define paths
paths = ["./db", "./db_meta", "./db_meta_idx", "./db_text_idx", "./matrix"]

# Cleanup for fresh start
for p in paths:
    if os.path.exists(p): shutil.rmtree(p)

# Initialize
engine = SearchEngine(
    storage_base_path=paths[0],
    metadata_storage_base_path=paths[1],
    metadata_index_storage_base_path=paths[2],
    text_index_storage_base_path=paths[3],
    matrix_path=paths[4]
)

1. Create (Insert)

You can insert single documents or batches. Batching is recommended for speed.

# Single Insert
doc_id = engine.store_data("The quick brown fox", {"category": "animals"})
print(f"Inserted document ID: {doc_id}")

# Batch Insert (Faster)
docs = ["Apple pie recipe", "Banana bread recipe"]
metas = [{"type": "food"}, {"type": "food"}]
doc_ids = engine.store_data_batch(docs, metas)
print(f"Inserted batch: {doc_ids}")

2. Read (Search)

Search combines text queries with metadata filters.

# Simple Text Search
results = engine.search("apple", {}) 
# Returns: [(doc_id, text, metadata), ...]

# Metadata Filter + Text
# Finds "fox" only in docs where category == 'animals'
results = engine.search("fox", {"category": "animals"})

# Advanced Range Query
# Find "report" where year >= 2022
results = engine.search("report", {"year": {"$gte": 2022}})

# Set Query
# Find "recipe" where type is 'food' or 'dessert'
results = engine.search("recipe", {"type": {"$in": ["food", "dessert"]}})

3. Update

To update a document, simply store it again with the same doc_id. Note: This engine treats updates as "Upserts". For a clean update (ensuring old index keys are removed), it is often safer to Delete then Insert, though store_data with an existing ID will update the storage and add new index keys (old keys remain but point to the valid ID).

# 1. Store original
uid = engine.store_data("Old text content", {"version": 1}, doc_id="custom-id-123")

# 2. Update (Overwrite)
# This updates the content and adds 'new' and 'content' to the index for this ID.
engine.store_data("New content", {"version": 2}, doc_id="custom-id-123")

4. Delete

Delete documents based on metadata queries. This cleans up the storage and the indexes.

# Delete all documents with category 'animals'
deleted_count = engine.delete({"category": "animals"})
print(f"Deleted {deleted_count} documents.")

# Delete a specific document by ID (if ID is stored in metadata)
# Assuming you stored {"id": "custom-id-123"} in metadata:
engine.delete({"id": "custom-id-123"})

Performance Notes

  • Indexing: Use store_data_batch for mass ingestion. It groups atomic transactions to drastically reduce disk I/O.
  • Search: The engine is optimized for high-selectivity queries (where metadata or text filters reduce the candidate set to < 10,000 documents).

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric Value
Number of documents 1000
Document size (chars) 500
Storage throughput 22.63 docs/sec
Search throughput 6089.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.4.0.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

full_text_sparse_engine-0.4.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file full_text_sparse_engine-0.4.0.tar.gz.

File metadata

  • Download URL: full_text_sparse_engine-0.4.0.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.4.0.tar.gz
Algorithm Hash digest
SHA256 051100c22ec64030b82906250c6608a81303f977fec10183d8a69d551d188b81
MD5 0ea3d0a98ba7e66538ea14355a13a327
BLAKE2b-256 dc7b35eaaad4d48cc38e478713cb362dee57f45e40bf17825dcabe3a67e96538

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for full_text_sparse_engine-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b78f34e2bc4fb162d240a83e576ab093d19fcde2bc24d5b7f0c06eb0d00e6a52
MD5 a69e762531dacdbf53af9947478bf931
BLAKE2b-256 40b9ff7a12ee79b79ce3b75194edd41f2e5c7629d60970a4c5cd9e37f6d5e3f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page