No project description provided
Project description
Full text search engine - Hybrid LMDB/Inverted Index
This project is a high-performance, persistent full-text search engine written in Python. It utilizes a Hybrid Architecture combining an inverted index for sub-100ms candidate filtering and an on-the-fly BM25 vectorizer for relevant ranking.
Key Features
- Sub-100ms Search: optimized for real-time querying using a custom Inverted Index.
- Bidirectional Tree Lookup: Finds documents using smart prefix expansion ("exon" -> "exonerated") and recursive reduction ("exonerações" -> "exonera").
- Metadata Filtering: Powerful filtering with support for exact matches, ranges (
$gt,$lt), and sets ($in). - Atomic Batching: High-throughput indexing (thousands of docs/sec) via atomic batch commits to LMDB.
- Thread-Safe: Based on LMDB (Lightning Memory-Mapped Database).
- CRUD Support: Fully supports Create, Read, Update, and Delete operations.
Architecture
Unlike traditional vector search engines that rely heavily on pre-computed large dense matrices, this engine uses a two-step process:
- Filtering (Inverted Index): A sharded inverted index maps tokens to document IDs. It uses a bidirectional strategy:
- Forward Scan: Finds words starting with the query token (e.g., query "work" matches "working").
- Backward Scan: If no exact match is found, it reduces the query token to find root words (e.g., query "working" matches "work").
- Ranking (On-the-fly BM25): Once candidates are filtered (by metadata and text), a lightweight
IncrementalBM25model is built instantly in memory for just those candidates to score and rank them.
Usage
Initialization
from engine import SearchEngine
import os, shutil
# Define paths
paths = ["./db", "./db_meta", "./db_meta_idx", "./db_text_idx", "./matrix"]
# Cleanup for fresh start
for p in paths:
if os.path.exists(p): shutil.rmtree(p)
# Initialize
engine = SearchEngine(
storage_base_path=paths[0],
metadata_storage_base_path=paths[1],
metadata_index_storage_base_path=paths[2],
text_index_storage_base_path=paths[3],
matrix_path=paths[4]
)
1. Create (Insert)
You can insert single documents or batches. Batching is recommended for speed.
# Single Insert
doc_id = engine.store_data("The quick brown fox", {"category": "animals"})
print(f"Inserted document ID: {doc_id}")
# Batch Insert (Faster)
docs = ["Apple pie recipe", "Banana bread recipe"]
metas = [{"type": "food"}, {"type": "food"}]
doc_ids = engine.store_data_batch(docs, metas)
print(f"Inserted batch: {doc_ids}")
2. Read (Search)
Search combines text queries with metadata filters.
# Simple Text Search
results = engine.search("apple", {})
# Returns: [(doc_id, text, metadata), ...]
# Metadata Filter + Text
# Finds "fox" only in docs where category == 'animals'
results = engine.search("fox", {"category": "animals"})
# Advanced Range Query
# Find "report" where year >= 2022
results = engine.search("report", {"year": {"$gte": 2022}})
# Set Query
# Find "recipe" where type is 'food' or 'dessert'
results = engine.search("recipe", {"type": {"$in": ["food", "dessert"]}})
3. Update
To update a document, simply store it again with the same doc_id. Note: This engine treats updates as "Upserts". For a clean update (ensuring old index keys are removed), it is often safer to Delete then Insert, though store_data with an existing ID will update the storage and add new index keys (old keys remain but point to the valid ID).
# 1. Store original
uid = engine.store_data("Old text content", {"version": 1}, doc_id="custom-id-123")
# 2. Update (Overwrite)
# This updates the content and adds 'new' and 'content' to the index for this ID.
engine.store_data("New content", {"version": 2}, doc_id="custom-id-123")
4. Delete
Delete documents based on metadata queries. This cleans up the storage and the indexes.
# Delete all documents with category 'animals'
deleted_count = engine.delete({"category": "animals"})
print(f"Deleted {deleted_count} documents.")
# Delete a specific document by ID (if ID is stored in metadata)
# Assuming you stored {"id": "custom-id-123"} in metadata:
engine.delete({"id": "custom-id-123"})
Performance Notes
- Indexing: Use
store_data_batchfor mass ingestion. It groups atomic transactions to drastically reduce disk I/O. - Search: The engine is optimized for high-selectivity queries (where metadata or text filters reduce the candidate set to < 10,000 documents).
- Cleanup: Always close the engine to release file locks.
engine.close()
The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.
| Metric | Value |
|---|---|
| Number of documents | 1000 |
| Document size (chars) | 500 |
| Storage throughput | 22.63 docs/sec |
| Search throughput | 6089.11 queries/sec |
These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file full_text_sparse_engine-0.3.0.tar.gz.
File metadata
- Download URL: full_text_sparse_engine-0.3.0.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2851fb123da9c46f0db44bbc576c8c8afc5405c22c471b32ff8392dbb44fb759
|
|
| MD5 |
e33354bea00938f70996e6d9d5f924f3
|
|
| BLAKE2b-256 |
cd1dcd6b48a98cf067d59a02cdf6493b9db693ff315ad1ae6e21bb7e81cb9710
|
File details
Details for the file full_text_sparse_engine-0.3.0-py3-none-any.whl.
File metadata
- Download URL: full_text_sparse_engine-0.3.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df6e90a861496a91877600bd1a271f346f36be46726508193903f4945ce22107
|
|
| MD5 |
0ef0f0cbc53d0cc3342a50a337820c28
|
|
| BLAKE2b-256 |
67e1b59ddda1de23e45ecc06924832f97e78ee68b63148eb1783838483144d34
|