No project description provided

These details have not been verified by PyPI

Project description

Full text search engine - Hybrid LMDB/Inverted Index

This project is a high-performance, persistent full-text search engine written in Python. It utilizes a Hybrid Architecture combining an inverted index for sub-100ms candidate filtering and an on-the-fly BM25 vectorizer for relevant ranking.

Key Features

Sub-100ms Search: optimized for real-time querying using a custom Inverted Index.
Bidirectional Tree Lookup: Finds documents using smart prefix expansion ("exon" -> "exonerated") and recursive reduction ("exonerações" -> "exonera").
Metadata Filtering: Powerful filtering with support for exact matches, ranges ($gt, $lt), and sets ($in).
Atomic Batching: High-throughput indexing (thousands of docs/sec) via atomic batch commits to LMDB.
Thread-Safe: Based on LMDB (Lightning Memory-Mapped Database).
CRUD Support: Fully supports Create, Read, Update, and Delete operations.

Architecture

Unlike traditional vector search engines that rely heavily on pre-computed large dense matrices, this engine uses a two-step process:

Filtering (Inverted Index): A sharded inverted index maps tokens to document IDs. It uses a bidirectional strategy:
- Forward Scan: Finds words starting with the query token (e.g., query "work" matches "working").
- Backward Scan: If no exact match is found, it reduces the query token to find root words (e.g., query "working" matches "work").
Ranking (On-the-fly BM25): Once candidates are filtered (by metadata and text), a lightweight IncrementalBM25 model is built instantly in memory for just those candidates to score and rank them.

Usage

Initialization

from engine import SearchEngine
import os, shutil

# Define paths
paths = ["./db", "./db_meta", "./db_meta_idx", "./db_text_idx", "./matrix"]

# Cleanup for fresh start
for p in paths:
    if os.path.exists(p): shutil.rmtree(p)

# Initialize
engine = SearchEngine(
    storage_base_path=paths[0],
    metadata_storage_base_path=paths[1],
    metadata_index_storage_base_path=paths[2],
    text_index_storage_base_path=paths[3],
    matrix_path=paths[4]
)

1. Create (Insert)

You can insert single documents or batches. Batching is recommended for speed.

# Single Insert
doc_id = engine.store_data("The quick brown fox", {"category": "animals"})
print(f"Inserted document ID: {doc_id}")

# Batch Insert (Faster)
docs = ["Apple pie recipe", "Banana bread recipe"]
metas = [{"type": "food"}, {"type": "food"}]
doc_ids = engine.store_data_batch(docs, metas)
print(f"Inserted batch: {doc_ids}")

2. Read (Search)

Search combines text queries with metadata filters.

# Simple Text Search
results = engine.search("apple", {}) 
# Returns: [(doc_id, text, metadata), ...]

# Metadata Filter + Text
# Finds "fox" only in docs where category == 'animals'
results = engine.search("fox", {"category": "animals"})

# Advanced Range Query
# Find "report" where year >= 2022
results = engine.search("report", {"year": {"$gte": 2022}})

# Set Query
# Find "recipe" where type is 'food' or 'dessert'
results = engine.search("recipe", {"type": {"$in": ["food", "dessert"]}})

3. Update

To update a document, simply store it again with the same doc_id. Note: This engine treats updates as "Upserts". For a clean update (ensuring old index keys are removed), it is often safer to Delete then Insert, though store_data with an existing ID will update the storage and add new index keys (old keys remain but point to the valid ID).

# 1. Store original
uid = engine.store_data("Old text content", {"version": 1}, doc_id="custom-id-123")

# 2. Update (Overwrite)
# This updates the content and adds 'new' and 'content' to the index for this ID.
engine.store_data("New content", {"version": 2}, doc_id="custom-id-123")

4. Delete

Delete documents based on metadata queries. This cleans up the storage and the indexes.

# Delete all documents with category 'animals'
deleted_count = engine.delete({"category": "animals"})
print(f"Deleted {deleted_count} documents.")

# Delete a specific document by ID (if ID is stored in metadata)
# Assuming you stored {"id": "custom-id-123"} in metadata:
engine.delete({"id": "custom-id-123"})

Performance Notes

Indexing: Use store_data_batch for mass ingestion. It groups atomic transactions to drastically reduce disk I/O.
Search: The engine is optimized for high-selectivity queries (where metadata or text filters reduce the candidate set to < 10,000 documents).

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

Metric	Value
Number of documents	1000
Document size (chars)	500
Storage throughput	22.63 docs/sec
Search throughput	6089.11 queries/sec

These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.4.0

Dec 8, 2025

0.3.0

Dec 8, 2025

0.2.1

Dec 4, 2025

0.2.0

Dec 4, 2025

0.1.0

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_text_sparse_engine-0.4.0.tar.gz (10.1 kB view details)

Uploaded Dec 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

full_text_sparse_engine-0.4.0-py3-none-any.whl (10.2 kB view details)

Uploaded Dec 8, 2025 Python 3

File details

Details for the file full_text_sparse_engine-0.4.0.tar.gz.

File metadata

Download URL: full_text_sparse_engine-0.4.0.tar.gz
Upload date: Dec 8, 2025
Size: 10.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`051100c22ec64030b82906250c6608a81303f977fec10183d8a69d551d188b81`
MD5	`0ea3d0a98ba7e66538ea14355a13a327`
BLAKE2b-256	`dc7b35eaaad4d48cc38e478713cb362dee57f45e40bf17825dcabe3a67e96538`

See more details on using hashes here.

File details

Details for the file full_text_sparse_engine-0.4.0-py3-none-any.whl.

File metadata

Download URL: full_text_sparse_engine-0.4.0-py3-none-any.whl
Upload date: Dec 8, 2025
Size: 10.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for full_text_sparse_engine-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b78f34e2bc4fb162d240a83e576ab093d19fcde2bc24d5b7f0c06eb0d00e6a52`
MD5	`a69e762531dacdbf53af9947478bf931`
BLAKE2b-256	`40b9ff7a12ee79b79ce3b75194edd41f2e5c7629d60970a4c5cd9e37f6d5e3f5`

See more details on using hashes here.

full-text-sparse-engine 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Full text search engine - Hybrid LMDB/Inverted Index

Key Features

Architecture

Usage

Initialization

1. Create (Insert)

2. Read (Search)

3. Update

4. Delete

Performance Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes