Skip to main content

No project description provided

Project description

Full text search engine - Hybrid LMDB/Inverted Index

This project is a high-performance, persistent full-text search engine written in Python. It utilizes a Hybrid Architecture combining an inverted index for sub-100ms candidate filtering and an on-the-fly BM25 vectorizer for relevant ranking.

Key Features

  • Sub-100ms Search: optimized for real-time querying using a custom Inverted Index.
  • Bidirectional Tree Lookup: Finds documents using smart prefix expansion ("exon" -> "exonerated") and recursive reduction ("exonerações" -> "exonera").
  • Metadata Filtering: Powerful filtering with support for exact matches, ranges ($gt, $lt), and sets ($in).
  • Atomic Batching: High-throughput indexing (thousands of docs/sec) via atomic batch commits to LMDB.
  • Thread-Safe: Based on LMDB (Lightning Memory-Mapped Database).
  • CRUD Support: Fully supports Create, Read, Update, and Delete operations.

Architecture

Unlike traditional vector search engines that rely heavily on pre-computed large dense matrices, this engine uses a two-step process:

  1. Filtering (Inverted Index): A sharded inverted index maps tokens to document IDs. It uses a bidirectional strategy:
    • Forward Scan: Finds words starting with the query token (e.g., query "work" matches "working").
    • Backward Scan: If no exact match is found, it reduces the query token to find root words (e.g., query "working" matches "work").
  2. Ranking (On-the-fly BM25): Once candidates are filtered (by metadata and text), a lightweight IncrementalBM25 model is built instantly in memory for just those candidates to score and rank them.

Usage

Initialization

from engine import SearchEngine
import os, shutil

# Define paths
paths = ["./db", "./db_meta", "./db_meta_idx", "./db_text_idx", "./matrix"]

# Cleanup for fresh start
for p in paths:
    if os.path.exists(p): shutil.rmtree(p)

# Initialize
engine = SearchEngine(
    storage_base_path=paths[0],
    metadata_storage_base_path=paths[1],
    metadata_index_storage_base_path=paths[2],
    text_index_storage_base_path=paths[3],
    matrix_path=paths[4]
)

1. Create (Insert)

You can insert single documents or batches. Batching is recommended for speed.

# Single Insert
doc_id = engine.store_data("The quick brown fox", {"category": "animals"})
print(f"Inserted document ID: {doc_id}")

# Batch Insert (Faster)
docs = ["Apple pie recipe", "Banana bread recipe"]
metas = [{"type": "food"}, {"type": "food"}]
doc_ids = engine.store_data_batch(docs, metas)
print(f"Inserted batch: {doc_ids}")

2. Read (Search)

Search combines text queries with metadata filters.

# Simple Text Search
results = engine.search("apple", {}) 
# Returns: [(doc_id, text, metadata), ...]

# Metadata Filter + Text
# Finds "fox" only in docs where category == 'animals'
results = engine.search("fox", {"category": "animals"})

# Advanced Range Query
# Find "report" where year >= 2022
results = engine.search("report", {"year": {"$gte": 2022}})

# Set Query
# Find "recipe" where type is 'food' or 'dessert'
results = engine.search("recipe", {"type": {"$in": ["food", "dessert"]}})

3. Update

To update a document, simply store it again with the same doc_id. Note: This engine treats updates as "Upserts". For a clean update (ensuring old index keys are removed), it is often safer to Delete then Insert, though store_data with an existing ID will update the storage and add new index keys (old keys remain but point to the valid ID).

# 1. Store original
uid = engine.store_data("Old text content", {"version": 1}, doc_id="custom-id-123")

# 2. Update (Overwrite)
# This updates the content and adds 'new' and 'content' to the index for this ID.
engine.store_data("New content", {"version": 2}, doc_id="custom-id-123")

4. Delete

Delete documents based on metadata queries. This cleans up the storage and the indexes.

# Delete all documents with category 'animals'
deleted_count = engine.delete({"category": "animals"})
print(f"Deleted {deleted_count} documents.")

# Delete a specific document by ID (if ID is stored in metadata)
# Assuming you stored {"id": "custom-id-123"} in metadata:
engine.delete({"id": "custom-id-123"})

Performance Notes

  • Indexing: Use store_data_batch for mass ingestion. It groups atomic transactions to drastically reduce disk I/O.
  • Search: The engine is optimized for high-selectivity queries (where metadata or text filters reduce the candidate set to < 10,000 documents).
  • Cleanup: Always close the engine to release file locks.
    engine.close()
    

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric Value
Number of documents 1000
Document size (chars) 500
Storage throughput 22.63 docs/sec
Search throughput 6089.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.3.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

full_text_sparse_engine-0.3.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file full_text_sparse_engine-0.3.0.tar.gz.

File metadata

  • Download URL: full_text_sparse_engine-0.3.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2851fb123da9c46f0db44bbc576c8c8afc5405c22c471b32ff8392dbb44fb759
MD5 e33354bea00938f70996e6d9d5f924f3
BLAKE2b-256 cd1dcd6b48a98cf067d59a02cdf6493b9db693ff315ad1ae6e21bb7e81cb9710

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for full_text_sparse_engine-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df6e90a861496a91877600bd1a271f346f36be46726508193903f4945ce22107
MD5 0ef0f0cbc53d0cc3342a50a337820c28
BLAKE2b-256 67e1b59ddda1de23e45ecc06924832f97e78ee68b63148eb1783838483144d34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page