Skip to main content

search through files with fts5, vectors and get reranked results. Fast

Project description

litesearch

NB Reading this on GitHub? The formatted documentation is nicer.

litesearch puts full-text search + SIMD vector search in a single SQLite database with automatic Reciprocal Rank Fusion (RRF) reranking — no server, no new infra, no heavy dependencies.

Module What you get
litesearch (core) database() · get_store() · db.search() · rrf_merge() · vec_search()
litesearch.data PDF extraction · Python code chunking · FTS query preprocessing
litesearch.utils ONNX text encoders (FastEncode) · images_to_pdf · images_to_markdown

Install

# usearch SQLite extensions are configured automatically on first import
# (macOS needs one extra step — see litesearch.postfix)
!uv add litesearch

Quick Start

Search your documents in eight lines of code:

from litesearch import *
from model2vec import StaticModel
import numpy as np

enc   = StaticModel.from_pretrained("minishlab/potion-retrieval-32M")  # fast static embeddings
db    = database()          # SQLite + usearch SIMD extensions loaded
store = db.get_store()      # table with FTS5 index + embedding column

texts = ["attention is all you need",
         "transformers replaced recurrent networks",
         "gradient descent minimises the loss"]
embs  = enc.encode(texts)   # float32, shape (3, 512)
store.insert_all([dict(content=t, embedding=e.tobytes()) for t, e in zip(texts, embs)])

q = "self-attention mechanism"
db.search(q, enc.encode([q])[0].tobytes(), columns=['id','content'], dtype=np.float32)
[{'rowid': 1, 'id': 1, 'content': 'attention is all you need',
  '_dist': 0.134, '_rrf_score': 0.0328},
 {'rowid': 2, 'id': 2, 'content': 'transformers replaced recurrent networks',
  '_dist': 0.264, '_rrf_score': 0.0161},
 {'rowid': 3, 'id': 3, 'content': 'gradient descent minimises the loss',
  '_dist': 0.482, '_rrf_score': 0.0161}]

_rrf_score is the fused rank score (higher = better). _dist is the cosine distance from the vector search leg.

Core API

database() — SQLite + SIMD

database() returns a fastlite Database patched with usearch’s SIMD distance functions. Pass a file path for persistence; omit it for an in-memory store.

from litesearch import *
import numpy as np

db = database()   # ':memory:' by default; use database('my.db') for persistence
db.q('select sqlite_version() as sqlite_version')
[{'sqlite_version': '3.52.0'}]

The usearch extension adds SIMD-accelerated distance functions directly into SQL. Four metrics are available: cosine, sqeuclidean, inner, and divergence. All variants support f32, f16, f64, and i8 suffixes.

vecs = dict(
    v1=np.ones((100,),  dtype=np.float32).tobytes(),   # ones
    v2=np.zeros((100,), dtype=np.float32).tobytes(),   # zeros
    v3=np.full((100,), 0.25, dtype=np.float32).tobytes()  # 0.25s (same direction as v1)
)
def dist_q(metric):
    return db.q(f'''
        select
            distance_{metric}_f32(:v1,:v2) as {metric}_v1_v2,
            distance_{metric}_f32(:v1,:v3) as {metric}_v1_v3,
            distance_{metric}_f32(:v2,:v3) as {metric}_v2_v3
    ''', vecs)

for fn in ['sqeuclidean', 'divergence', 'inner', 'cosine']: print(dist_q(fn))
[{'sqeuclidean_v1_v2': 100.0, 'sqeuclidean_v1_v3': 56.25, 'sqeuclidean_v2_v3': 6.25}]
[{'divergence_v1_v2': 34.657352447509766, 'divergence_v1_v3': 12.046551704406738, 'divergence_v2_v3': 8.66433334350586}]
[{'inner_v1_v2': 1.0, 'inner_v1_v3': -24.0, 'inner_v2_v3': 1.0}]
[{'cosine_v1_v2': 1.0, 'cosine_v1_v3': 0.0, 'cosine_v2_v3': 1.0}]

Cosine distance between v1 (ones) and v3 (0.25s) is 0.0 — they point in the same direction. Both inner and divergence are also available for different retrieval trade-offs.

get_store() — FTS5 + Embedding Table

db.get_store() creates (or opens) a table with a content TEXT column, an embedding BLOB column, a JSON metadata column, and an FTS5 full-text index that stays in sync automatically via triggers.

store = db.get_store()   # idempotent — safe to call multiple times
store.schema
'CREATE TABLE [store] (\n   [content] TEXT NOT NULL,\n   [embedding] BLOB,\n   [metadata] TEXT,\n   [uploaded_at] FLOAT DEFAULT CURRENT_TIMESTAMP,\n   [id] INTEGER PRIMARY KEY\n)'

Pass hash=True to use a content-addressed id (SHA-1 of the content). Useful for code search and deduplication — re-inserting the same content is a no-op:

code_store = db.get_store(name='code', hash=True)
code_store.insert_all([
    dict(content='hello world',  embedding=np.ones( (100,), dtype=np.float16).tobytes()),
    dict(content='hi there',     embedding=np.full( (100,), 0.5, dtype=np.float16).tobytes()),
    dict(content='goodbye now',  embedding=np.zeros((100,), dtype=np.float16).tobytes()),
], upsert=True, hash_id='id')
code_store(select='id,content')
[{'id': '250ce2bffa97ab21fa9ab2922d19993454a0cf28', 'content': 'hello world'},
 {'id': 'c89f43361891bfab9290bcebf182fa5978f89700', 'content': 'hi there'},
 {'id': '882293d5e5c3d3e04e8e0c4f7c01efba904d0932', 'content': 'goodbye now'}]

db.search() — Hybrid FTS + Vector with RRF

db.search() runs both an FTS5 keyword query and a vector similarity search, then merges the ranked lists with Reciprocal Rank Fusion. Documents that appear in both lists get a score boost — the best of both worlds.

# Re-create a clean store for the search demo
db2  = database()
st2  = db2.get_store()

phrases = [
    "attention mechanisms in neural networks",
    "transformer architecture for sequence modelling",
    "stochastic gradient descent and learning rate schedules",
    "positional encoding and token embeddings",
    "dropout regularisation reduces overfitting",
]
# use float32 vectors (matching dtype= below)
vecs2 = [np.random.default_rng(i).random(64, dtype=np.float32) for i in range(len(phrases))]
st2.insert_all([dict(content=p, embedding=v.tobytes()) for p, v in zip(phrases, vecs2)])
<Table store (content, embedding, metadata, uploaded_at, id)>
q2    = "attention"
q_vec = np.random.default_rng(42).random(64, dtype=np.float32).tobytes()
db2.search(q2, q_vec, columns=['id','content'], dtype=np.float32)

Pass rrf=False to see the raw FTS and vector legs separately — handy for debugging relevance:

db2.search(q2, q_vec, columns=['id','content'], dtype=np.float32, rrf=False)

Tip — dtype matters. Always pass the same dtype used when encoding. model2vec and most ONNX models return float32; pass dtype=np.float32. The default is float16 (matches FastEncode).

Tip — custom schemas. get_store() is a convenience. For custom schemas, call db.t['my_table'].vec_search(emb, ...) and rrf_merge(fts_results, vec_results) directly.

litesearch.data

Query Preprocessing

FTS5 is powerful, but raw natural-language queries often miss results. litesearch.data ships helpers to transform queries before sending them to FTS:

from litesearch.data import clean, add_wc, mk_wider, kw, pre

q = 'This is a sample query'
print('preprocessed q with defaults: `%s`' % pre(q))
print('keywords extracted: `%s`'          % pre(q, wc=False, wide=False))
print('q with wild card: `%s`'            % pre(q, extract_kw=False, wide=False, wc=True))
preprocessed q with defaults: `query* OR sample*`
keywords extracted: `query sample`
q with wild card: `This* is* a* sample* query*`
Function What it does
clean(q) strips * and returns None for empty queries
add_wc(q) appends * to each word for prefix matching
mk_wider(q) joins words with OR for broader matching
kw(q) extracts keywords via YAKE (removes stop-words)
pre(q) applies all of the above in one call

PDF Extraction

litesearch.data patches pdf_oxide.PdfDocument with bulk page-extraction methods. All methods take optional st / end page indices and return a fastcore L list:

Method Returns
doc.pdf_texts(st, end) plain text per page
doc.pdf_markdown(st, end) markdown with headings + tables detected
doc.pdf_links(st, end) URI strings extracted from annotations
doc.pdf_tables(st, end) structured rows / cells / bbox dicts
doc.pdf_spans(st, end) text spans with font size, weight, bbox
doc.pdf_images(st, end, output_dir) image metadata, or save to disk
from litesearch.data import PdfDocument

doc = PdfDocument('pdfs/attention_is_all_you_need.pdf')
print(f'{doc.page_count()} pages, {len(doc.pdf_links())} links')

# plain text of page 1
doc.pdf_texts(0, 1)[0][:300]
15 pages, 44 links
'Abstract\nThe dominant sequence transduction models are based on complex recurrent...'
# markdown export — headings and tables are detected automatically
md = doc.pdf_markdown()
print(f'Page 1 (markdown):\n{md[0][:400]}')
Page 1 (markdown):
# arXiv:1706.03762v7  [cs.CL]  2 Aug 2023

Provided proper attribution is provided, Google hereby grants permission
to reproduce the tables and figures in this paper solely for use in
journalistic or scholarly works...

Code Ingestion

pyparse splits a Python file or string into top-level code chunks (functions, classes, assignments) with source location metadata — ready to insert into a store:

from litesearch.data import pyparse

txt = """
from fastcore.all import *
a=1
class SomeClass:
    def __init__(self,x): store_attr()
    def method(self): return self.x + a
"""
pyparse(code=txt)
[{'content': 'a=1',
  'metadata': {'path': None, 'uploaded_at': None, 'name': None, 'type': 'Assign', 'lineno': 3, 'end_lineno': 3}},
 {'content': 'class SomeClass:\n    def __init__(self,x): store_attr()\n    def method(self): return self.x + a',
  'metadata': {'path': None, 'uploaded_at': None, 'name': 'SomeClass', 'type': 'ClassDef', 'lineno': 4, 'end_lineno': 6}}]

pkg2chunks indexes an entire installed package in one call — great for building a semantic code-search store over your dependencies:

from litesearch.data import pkg2chunks

chunks = pkg2chunks('fastlite')
print(f'{len(chunks)} chunks from fastlite')
chunks.filter(lambda d: d['metadata']['type'] == 'FunctionDef')[0]
47 chunks from fastlite
{'content': 'def t(self:Database): return _TablesGetter(self)',
 'metadata': {'path': '.../fastlite/core.py',
              'name': 't', 'type': 'FunctionDef',
              'lineno': 44, 'end_lineno': 44,
              'package': 'fastlite', 'version': '0.2.4'}}

litesearch.utils

FastEncode — ONNX Text Encoder

FastEncode wraps any ONNX model from HuggingFace Hub with a HF tokenizer. It supports document and query prompt templates, automatic CUDA/CPU provider selection, and multi-threaded inference. Two pre-configured model dicts ship with litesearch:

Config Model Embedding dim Notes
embedding_gemma (default) onnx-community/embeddinggemma-300m-ONNX 768 Strong retrieval, ~300M params
modernbert nomic-ai/modernbert-embed-base 768 Modern BERT-style

Pass dtype=np.float16 (default) to halve memory; use np.float32 for full precision.

from litesearch.utils import FastEncode, modernbert

# Downloads model on first run, cached thereafter
enc = FastEncode()   # embedding_gemma by default

doc_embs = enc.encode_document(['Attention is all you need', 'Another document'])
q_emb    = enc.encode_query(['what paper introduced transformers?'])
print('doc shape:', doc_embs.shape, 'dtype:', doc_embs.dtype)
print('q   shape:', q_emb.shape)

# Switch to ModernBERT
modern = FastEncode(modernbert)
modern.encode_query(['search query here'])
doc shape: (2, 768) dtype: float16
q   shape: (1, 768)
array([[-0.0503, -0.04352, -0.0171, ..., -0.04977, 0.01598, -0.0706]],
      shape=(1, 768), dtype=float16)

Image Tools

images_to_pdf wraps a list of images (PIL, bytes, or file paths) into a conformant multi-page PDF — useful for creating synthetic scanned documents for OCR testing. images_to_markdown embeds them as base64 data URIs for LLM input or notebook display.

from litesearch.utils import images_to_pdf
from PIL import Image

# Any mix of PIL Images, raw bytes, or file paths works
imgs = [Image.new('RGB', (200, 100), color=(73, 109, 137)),
        Image.new('RGB', (200, 100), color=(255, 200, 50))]

pdf_bytes = images_to_pdf(imgs)                 # returns bytes
images_to_pdf(imgs, output='out.pdf')           # or write to disk

Ideas for More Delight (Planned)

Things that would make litesearch even smoother to use:

Idea Why it helps
Retriever class — bundles encoder + db into r.search(q) removes the manual encode → bytes → search boilerplate
ingest(texts, encoder, store) helper one-liner for embed-and-insert loops
Auto dtype detection search() could infer dtype from stored embedding size, removing the dtype=np.float32 footgun
from_pdf(path, encoder) / from_dir(dir, encoder) index a PDF or folder in one call
Rich / tabulate display for results pretty-print search results in notebooks
Metadata filter sugarfilters={'source': 'doc.pdf'} cleaner than writing raw SQL where strings
CLIlitesearch index <dir> / litesearch search <q> quick ad-hoc search without writing Python

Next Steps

Acknowledgements

A big thank you to @yfedoseev for pdf-oxide, which powers the PDF extraction functionality in litesearch.data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litesearch-0.0.13.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litesearch-0.0.13-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file litesearch-0.0.13.tar.gz.

File metadata

  • Download URL: litesearch-0.0.13.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for litesearch-0.0.13.tar.gz
Algorithm Hash digest
SHA256 3aa204990084ee0a5ad5bf109d45ede53d6b1c9fa44aefc703af0a7f8559f0e2
MD5 003886a70e6bd91c38a9f83590c07c77
BLAKE2b-256 b7c787a1117d1a6697db42c6e0927673d3a90a86f170dafc41d371ec1794a017

See more details on using hashes here.

File details

Details for the file litesearch-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: litesearch-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for litesearch-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 6afc6085f9bfe1950118e01e8b6a9c336557de16e1e8998e5cf65cfd388816f8
MD5 f0021c02241e1955f42dc79bc2c930d6
BLAKE2b-256 06697eeec05e09ddc1986834e20af3c091279a4d141e0f01bef7bd63332c0320

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page