search through files with fts5, vectors and get reranked results. Fast
Project description
litesearch
NB Reading this on GitHub? The formatted documentation is nicer.
litesearch puts full-text search + SIMD vector search in a single
SQLite database with automatic Reciprocal Rank Fusion (RRF)
reranking — no server, no new infra, no heavy dependencies.
| Module | What you get |
|---|---|
litesearch (core) |
database() · get_store() · db.search() · rrf_merge() · vec_search() |
litesearch.data |
PDF extraction · Python code chunking · FTS query preprocessing |
litesearch.utils |
ONNX text encoders (FastEncode) · images_to_pdf · images_to_markdown |
Install
# usearch SQLite extensions are configured automatically on first import
# (macOS needs one extra step — see litesearch.postfix)
!uv add litesearch
Quick Start
Search your documents in eight lines of code:
from litesearch import *
from model2vec import StaticModel
import numpy as np
enc = StaticModel.from_pretrained("minishlab/potion-retrieval-32M") # fast static embeddings
db = database() # SQLite + usearch SIMD extensions loaded
store = db.get_store() # table with FTS5 index + embedding column
texts = ["attention is all you need",
"transformers replaced recurrent networks",
"gradient descent minimises the loss"]
embs = enc.encode(texts) # float32, shape (3, 512)
store.insert_all([dict(content=t, embedding=e.tobytes()) for t, e in zip(texts, embs)])
q = "self-attention mechanism"
db.search(q, enc.encode([q])[0].tobytes(), columns=['id','content'], dtype=np.float32)
[{'rowid': 1, 'id': 1, 'content': 'attention is all you need',
'_dist': 0.134, '_rrf_score': 0.0328},
{'rowid': 2, 'id': 2, 'content': 'transformers replaced recurrent networks',
'_dist': 0.264, '_rrf_score': 0.0161},
{'rowid': 3, 'id': 3, 'content': 'gradient descent minimises the loss',
'_dist': 0.482, '_rrf_score': 0.0161}]
_rrf_scoreis the fused rank score (higher = better)._distis the cosine distance from the vector search leg.
Core API
database() — SQLite + SIMD
database()
returns a fastlite Database patched
with usearch’s SIMD distance functions. Pass a file path for
persistence; omit it for an in-memory store.
from litesearch import *
import numpy as np
db = database() # ':memory:' by default; use database('my.db') for persistence
db.q('select sqlite_version() as sqlite_version')
[{'sqlite_version': '3.52.0'}]
The usearch extension adds SIMD-accelerated distance functions directly
into SQL. Four metrics are available: cosine, sqeuclidean, inner,
and divergence. All variants support f32, f16, f64, and i8
suffixes.
vecs = dict(
v1=np.ones((100,), dtype=np.float32).tobytes(), # ones
v2=np.zeros((100,), dtype=np.float32).tobytes(), # zeros
v3=np.full((100,), 0.25, dtype=np.float32).tobytes() # 0.25s (same direction as v1)
)
def dist_q(metric):
return db.q(f'''
select
distance_{metric}_f32(:v1,:v2) as {metric}_v1_v2,
distance_{metric}_f32(:v1,:v3) as {metric}_v1_v3,
distance_{metric}_f32(:v2,:v3) as {metric}_v2_v3
''', vecs)
for fn in ['sqeuclidean', 'divergence', 'inner', 'cosine']: print(dist_q(fn))
[{'sqeuclidean_v1_v2': 100.0, 'sqeuclidean_v1_v3': 56.25, 'sqeuclidean_v2_v3': 6.25}]
[{'divergence_v1_v2': 34.657352447509766, 'divergence_v1_v3': 12.046551704406738, 'divergence_v2_v3': 8.66433334350586}]
[{'inner_v1_v2': 1.0, 'inner_v1_v3': -24.0, 'inner_v2_v3': 1.0}]
[{'cosine_v1_v2': 1.0, 'cosine_v1_v3': 0.0, 'cosine_v2_v3': 1.0}]
Cosine distance between v1 (ones) and v3 (0.25s) is 0.0 — they point in the same direction. Both
inneranddivergenceare also available for different retrieval trade-offs.
get_store() — FTS5 + Embedding Table
db.get_store() creates (or opens) a table with a content TEXT
column, an embedding BLOB column, a JSON metadata column, and an
FTS5 full-text index that stays in sync automatically via triggers.
store = db.get_store() # idempotent — safe to call multiple times
store.schema
'CREATE TABLE [store] (\n [content] TEXT NOT NULL,\n [embedding] BLOB,\n [metadata] TEXT,\n [uploaded_at] FLOAT DEFAULT CURRENT_TIMESTAMP,\n [id] INTEGER PRIMARY KEY\n)'
Pass hash=True to use a content-addressed id (SHA-1 of the
content). Useful for code search and deduplication — re-inserting the
same content is a no-op:
code_store = db.get_store(name='code', hash=True)
code_store.insert_all([
dict(content='hello world', embedding=np.ones( (100,), dtype=np.float16).tobytes()),
dict(content='hi there', embedding=np.full( (100,), 0.5, dtype=np.float16).tobytes()),
dict(content='goodbye now', embedding=np.zeros((100,), dtype=np.float16).tobytes()),
], upsert=True, hash_id='id')
code_store(select='id,content')
[{'id': '250ce2bffa97ab21fa9ab2922d19993454a0cf28', 'content': 'hello world'},
{'id': 'c89f43361891bfab9290bcebf182fa5978f89700', 'content': 'hi there'},
{'id': '882293d5e5c3d3e04e8e0c4f7c01efba904d0932', 'content': 'goodbye now'}]
db.search() — Hybrid FTS + Vector with RRF
db.search() runs both an FTS5 keyword query and a vector
similarity search, then merges the ranked lists with Reciprocal Rank
Fusion. Documents that appear in both lists get a score boost — the
best of both worlds.
# Re-create a clean store for the search demo
db2 = database()
st2 = db2.get_store()
phrases = [
"attention mechanisms in neural networks",
"transformer architecture for sequence modelling",
"stochastic gradient descent and learning rate schedules",
"positional encoding and token embeddings",
"dropout regularisation reduces overfitting",
]
# use float32 vectors (matching dtype= below)
vecs2 = [np.random.default_rng(i).random(64, dtype=np.float32) for i in range(len(phrases))]
st2.insert_all([dict(content=p, embedding=v.tobytes()) for p, v in zip(phrases, vecs2)])
<Table store (content, embedding, metadata, uploaded_at, id)>
q2 = "attention"
q_vec = np.random.default_rng(42).random(64, dtype=np.float32).tobytes()
db2.search(q2, q_vec, columns=['id','content'], dtype=np.float32)
Pass rrf=False to see the raw FTS and vector legs separately — handy
for debugging relevance:
db2.search(q2, q_vec, columns=['id','content'], dtype=np.float32, rrf=False)
Tip — dtype matters. Always pass the same
dtypeused when encoding.model2vecand most ONNX models returnfloat32; passdtype=np.float32. The default isfloat16(matchesFastEncode).
Tip — custom schemas.
get_store()is a convenience. For custom schemas, calldb.t['my_table'].vec_search(emb, ...)andrrf_merge(fts_results, vec_results)directly.
litesearch.data
Query Preprocessing
FTS5 is powerful, but raw natural-language queries often miss results.
litesearch.data ships helpers to transform queries before sending them
to FTS:
from litesearch.data import clean, add_wc, mk_wider, kw, pre
q = 'This is a sample query'
print('preprocessed q with defaults: `%s`' % pre(q))
print('keywords extracted: `%s`' % pre(q, wc=False, wide=False))
print('q with wild card: `%s`' % pre(q, extract_kw=False, wide=False, wc=True))
preprocessed q with defaults: `query* OR sample*`
keywords extracted: `query sample`
q with wild card: `This* is* a* sample* query*`
| Function | What it does |
|---|---|
clean(q) |
strips * and returns None for empty queries |
add_wc(q) |
appends * to each word for prefix matching |
mk_wider(q) |
joins words with OR for broader matching |
kw(q) |
extracts keywords via YAKE (removes stop-words) |
pre(q) |
applies all of the above in one call |
PDF Extraction
litesearch.data patches pdf_oxide.PdfDocument with bulk
page-extraction methods. All methods take optional st / end page
indices and return a fastcore L list:
| Method | Returns |
|---|---|
doc.pdf_texts(st, end) |
plain text per page |
doc.pdf_markdown(st, end) |
markdown with headings + tables detected |
doc.pdf_links(st, end) |
URI strings extracted from annotations |
doc.pdf_tables(st, end) |
structured rows / cells / bbox dicts |
doc.pdf_spans(st, end) |
text spans with font size, weight, bbox |
doc.pdf_images(st, end, output_dir) |
image metadata, or save to disk |
from litesearch.data import PdfDocument
doc = PdfDocument('pdfs/attention_is_all_you_need.pdf')
print(f'{doc.page_count()} pages, {len(doc.pdf_links())} links')
# plain text of page 1
doc.pdf_texts(0, 1)[0][:300]
15 pages, 44 links
'Abstract\nThe dominant sequence transduction models are based on complex recurrent...'
# markdown export — headings and tables are detected automatically
md = doc.pdf_markdown()
print(f'Page 1 (markdown):\n{md[0][:400]}')
Page 1 (markdown):
# arXiv:1706.03762v7 [cs.CL] 2 Aug 2023
Provided proper attribution is provided, Google hereby grants permission
to reproduce the tables and figures in this paper solely for use in
journalistic or scholarly works...
Code Ingestion
pyparse
splits a Python file or string into top-level code chunks (functions,
classes, assignments) with source location metadata — ready to insert
into a store:
from litesearch.data import pyparse
txt = """
from fastcore.all import *
a=1
class SomeClass:
def __init__(self,x): store_attr()
def method(self): return self.x + a
"""
pyparse(code=txt)
[{'content': 'a=1',
'metadata': {'path': None, 'uploaded_at': None, 'name': None, 'type': 'Assign', 'lineno': 3, 'end_lineno': 3}},
{'content': 'class SomeClass:\n def __init__(self,x): store_attr()\n def method(self): return self.x + a',
'metadata': {'path': None, 'uploaded_at': None, 'name': 'SomeClass', 'type': 'ClassDef', 'lineno': 4, 'end_lineno': 6}}]
pkg2chunks
indexes an entire installed package in one call — great for building
a semantic code-search store over your dependencies:
from litesearch.data import pkg2chunks
chunks = pkg2chunks('fastlite')
print(f'{len(chunks)} chunks from fastlite')
chunks.filter(lambda d: d['metadata']['type'] == 'FunctionDef')[0]
47 chunks from fastlite
{'content': 'def t(self:Database): return _TablesGetter(self)',
'metadata': {'path': '.../fastlite/core.py',
'name': 't', 'type': 'FunctionDef',
'lineno': 44, 'end_lineno': 44,
'package': 'fastlite', 'version': '0.2.4'}}
litesearch.utils
FastEncode — ONNX Text Encoder
FastEncode
wraps any ONNX model from HuggingFace Hub with a HF tokenizer. It
supports document and query prompt templates, automatic CUDA/CPU
provider selection, and multi-threaded inference. Two pre-configured
model dicts ship with litesearch:
| Config | Model | Embedding dim | Notes |
|---|---|---|---|
embedding_gemma (default) |
onnx-community/embeddinggemma-300m-ONNX |
768 | Strong retrieval, ~300M params |
modernbert |
nomic-ai/modernbert-embed-base |
768 | Modern BERT-style |
Pass dtype=np.float16 (default) to halve memory; use np.float32 for
full precision.
from litesearch.utils import FastEncode, modernbert
# Downloads model on first run, cached thereafter
enc = FastEncode() # embedding_gemma by default
doc_embs = enc.encode_document(['Attention is all you need', 'Another document'])
q_emb = enc.encode_query(['what paper introduced transformers?'])
print('doc shape:', doc_embs.shape, 'dtype:', doc_embs.dtype)
print('q shape:', q_emb.shape)
# Switch to ModernBERT
modern = FastEncode(modernbert)
modern.encode_query(['search query here'])
doc shape: (2, 768) dtype: float16
q shape: (1, 768)
array([[-0.0503, -0.04352, -0.0171, ..., -0.04977, 0.01598, -0.0706]],
shape=(1, 768), dtype=float16)
Image Tools
images_to_pdf
wraps a list of images (PIL, bytes, or file paths) into a conformant
multi-page PDF — useful for creating synthetic scanned documents for OCR
testing. images_to_markdown embeds them as base64 data URIs for LLM
input or notebook display.
from litesearch.utils import images_to_pdf
from PIL import Image
# Any mix of PIL Images, raw bytes, or file paths works
imgs = [Image.new('RGB', (200, 100), color=(73, 109, 137)),
Image.new('RGB', (200, 100), color=(255, 200, 50))]
pdf_bytes = images_to_pdf(imgs) # returns bytes
images_to_pdf(imgs, output='out.pdf') # or write to disk
Ideas for More Delight (Planned)
Things that would make litesearch even smoother to use:
| Idea | Why it helps |
|---|---|
Retriever class — bundles encoder + db into r.search(q) |
removes the manual encode → bytes → search boilerplate |
ingest(texts, encoder, store) helper |
one-liner for embed-and-insert loops |
| Auto dtype detection | search() could infer dtype from stored embedding size, removing the dtype=np.float32 footgun |
from_pdf(path, encoder) / from_dir(dir, encoder) |
index a PDF or folder in one call |
| Rich / tabulate display for results | pretty-print search results in notebooks |
Metadata filter sugar — filters={'source': 'doc.pdf'} |
cleaner than writing raw SQL where strings |
CLI — litesearch index <dir> / litesearch search <q> |
quick ad-hoc search without writing Python |
Next Steps
- examples/01_simple_rag.ipynb — ingest a folder of PDFs, chunk with chonkie, rerank with FlashRank
- examples/02_tool_use.ipynb — wire litesearch into an LLM tool-use loop
- core docs —
full API reference for
database,get_store,search,rrf_merge,vec_search - data docs —
PDF methods,
pyparse,pkg2chunks, query preprocessing - utils docs —
FastEncode,download_model, image tools
Acknowledgements
A big thank you to @yfedoseev for
pdf-oxide, which powers the
PDF extraction functionality in litesearch.data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litesearch-0.0.13.tar.gz.
File metadata
- Download URL: litesearch-0.0.13.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aa204990084ee0a5ad5bf109d45ede53d6b1c9fa44aefc703af0a7f8559f0e2
|
|
| MD5 |
003886a70e6bd91c38a9f83590c07c77
|
|
| BLAKE2b-256 |
b7c787a1117d1a6697db42c6e0927673d3a90a86f170dafc41d371ec1794a017
|
File details
Details for the file litesearch-0.0.13-py3-none-any.whl.
File metadata
- Download URL: litesearch-0.0.13-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6afc6085f9bfe1950118e01e8b6a9c336557de16e1e8998e5cf65cfd388816f8
|
|
| MD5 |
f0021c02241e1955f42dc79bc2c930d6
|
|
| BLAKE2b-256 |
06697eeec05e09ddc1986834e20af3c091279a4d141e0f01bef7bd63332c0320
|