Skip to main content

A tiny, SQLite-backed search library for small, local projects

Project description

sqlitesearch

A tiny, SQLite-backed search library for small-scale projects with up to 100,000 documents. It provides text search, vector search, and hybrid search - all stored in a single .db file with zero infrastructure.

sqlitesearch is a persistent sibling of minsearch - same API, but stores data on disk.

Installation

uv add sqlitesearch

Text Search

Text search uses SQLite's FTS5 (Full-Text Search) extension with BM25 ranking.

Basic Usage

from sqlitesearch import TextSearchIndex

# Create an index
index = TextSearchIndex(
    text_fields=["title", "description"],
    keyword_fields=["category"],
    db_path="search.db"
)

# Index documents in bulk
documents = [
    {"id": 1, "title": "Python Tutorial", "description": "Learn Python basics", "category": "tutorial"},
    {"id": 2, "title": "Java Guide", "description": "Java programming guide", "category": "guide"},
]
index.fit(documents)

# Or add one at a time
index.add({"id": 3, "title": "Advanced Python", "description": "Deep dive into Python", "category": "tutorial"})

# Search
results = index.search("python programming")
for result in results:
    print(result["title"], result["score"])

Filtering

# Filter by keyword fields
results = index.search("python", filter_dict={"category": "tutorial"})

# Filter by numeric range
results = index.search("python", filter_dict={"price": [('>=', 50), ('<', 200)]})

# Exact numeric match
results = index.search("python", filter_dict={"price": 100})

# Filter by date range
from datetime import date
results = index.search("python", filter_dict={
    "created_at": [('>=', date(2024, 1, 1)), ('<', date(2024, 12, 31))]
})

Field Boosting

# Boost title matches higher than description
results = index.search("python", boost_dict={"title": 2.0, "description": 1.0})

Tokenizer & Stemming

sqlitesearch uses a Tokenizer class for query processing (same interface as minsearch.Tokenizer). By default, English stop words are removed.

from sqlitesearch import TextSearchIndex, Tokenizer

# Built-in Porter stemming: "running" matches "run", "courses" matches "course"
index = TextSearchIndex(
    text_fields=["title", "description"],
    stemming=True,  # disabled by default to match minsearch behavior
    db_path="search.db"
)

# Custom tokenizer: no stop words
index = TextSearchIndex(
    text_fields=["title", "description"],
    tokenizer=Tokenizer(),
    db_path="search.db"
)

# Custom tokenizer: custom stop words + custom stemmer (any callable(str) -> str)
from minsearch.stemmers import porter_stemmer  # pip install minsearch

index = TextSearchIndex(
    text_fields=["title", "description"],
    tokenizer=Tokenizer(stop_words={"custom", "words"}, stemmer=porter_stemmer),
    db_path="search.db"
)

Custom ID Field

index = TextSearchIndex(
    text_fields=["title", "description"],
    id_field="doc_id",
    db_path="search.db"
)

results = index.search("python", output_ids=True)
# Results will include 'id' field with the doc_id value

Vector Search

Vector search supports three modes for approximate nearest neighbor search, all followed by exact cosine similarity reranking:

Mode Best for How it works
LSH (default) Up to 100K vectors Random hyperplane projections + bucket lookup
IVF 10K-500K vectors K-means clustering + nearest-cluster probe
HNSW 10K-1M+ vectors Hierarchical proximity graph traversal

LSH (default)

Each vector is hashed into one bucket per table using random hyperplane projections. At query time, LSH looks up buckets matching the query's hash to find candidates, then reranks them by exact cosine similarity. With n_probe > 0 (multi-probe), it also checks neighboring buckets that differ by 1 or 2 bits — this dramatically improves recall because similar vectors that landed in an adjacent bucket (due to one projection going the other way) are still found.

import numpy as np
from sqlitesearch import VectorSearchIndex

index = VectorSearchIndex(
    keyword_fields=["category"],
    n_tables=8,      # Number of hash tables (more = better recall)
    hash_size=16,    # Bits per hash (more = better precision)
    n_probe=2,       # Multi-probe bit flips (0-2, higher = better recall)
    db_path="vectors.db"
)

vectors = np.random.rand(100, 384)
documents = [{"category": "test"} for _ in range(100)]
index.fit(vectors, documents)

query = np.random.rand(384)
results = index.search(query)

IVF (Inverted File Index)

Clusters vectors using k-means, then searches only the nearest clusters at query time. Good balance of build speed and recall.

index = VectorSearchIndex(
    mode="ivf",
    n_clusters=None,        # Auto-scales (sqrt(n), capped at 256)
    n_probe_clusters=8,     # Clusters to search (more = better recall, slower)
    db_path="vectors.db"
)

HNSW (Hierarchical Navigable Small World)

Builds a multi-layer proximity graph. Highest recall and fastest search, but slower to build.

index = VectorSearchIndex(
    mode="hnsw",
    m=16,                   # Max connections per node (more = better recall)
    ef_construction=200,    # Build-time beam width (more = better graph)
    ef_search=50,           # Search-time beam width (more = better recall)
    db_path="vectors.db"
)

Filtering works the same as text search - see the Filtering section.

Hybrid Search

Text and vector indexes can share the same database file, enabling hybrid search.

from sqlitesearch import TextSearchIndex, VectorSearchIndex

text_index = TextSearchIndex(text_fields=["title", "description"], db_path="hybrid.db")
vector_index = VectorSearchIndex(db_path="hybrid.db")

text_results = text_index.search("python tutorial")
vector_results = vector_index.search(query_vector)

# Combine and deduplicate results based on your ranking strategy

Index Management

Both index types automatically persist to disk. Reopen an existing index by creating it with the same db_path - it's ready to search immediately. Use index.clear() to remove all documents.

When to Use

sqlitesearch is ideal when you want:

  • Zero infrastructure (no external services)
  • Data persistence across restarts
  • Real search functionality for pet projects, demos, or prototypes
  • Simple deployment (just a Python file and a .db file)
Use case Recommendation
In-memory / experiments minsearch (e.g., in notebooks)
Local projects, up to 100K docs sqlitesearch
Production / high traffic / 1M+ Elasticsearch, Qdrant, Milvus, etc.

Benchmarks

We benchmarked sqlitesearch on Simple English Wikipedia (291K articles) for text search and the Cohere-1M dataset (768d vectors) for vector search.

Type 1K 10K 100K
Text search QPS 970 604 179
Text search latency 1ms 2ms 6ms
Vector search QPS 333 39 6
Vector search latency 3ms 26ms 181ms
Vector recall@100 0.65 0.97 0.89

Vector search uses multi-probe LSH (n_probe=2) with in-memory vector cache for reranking. At 100K, recall (0.89) is competitive with cloud vector databases like ElasticCloud (0.90). For higher recall, use n_tables=16 (0.95 recall). See benchmark/WRITEUP.md for full results, recall tuning, and VDBBench leaderboard comparison.

Architecture

Everything lives in a single SQLite database file. Text search uses FTS5 with BM25 ranking. Vector search uses Locality-Sensitive Hashing (LSH) with random projections for fast candidate retrieval, followed by exact cosine similarity reranking via NumPy. No separate server process, no network communication - SQLite runs inside your Python process, reading and writing directly to the file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlitesearch-0.0.4.tar.gz (233.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlitesearch-0.0.4-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file sqlitesearch-0.0.4.tar.gz.

File metadata

  • Download URL: sqlitesearch-0.0.4.tar.gz
  • Upload date:
  • Size: 233.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for sqlitesearch-0.0.4.tar.gz
Algorithm Hash digest
SHA256 55109bf406422820b6f4108acc11124e001c15ce4a9db33c319008a88fd61b7c
MD5 0e33897792edc36e9b6b89cb5ca3ea1e
BLAKE2b-256 1e601128ccfa1896b21e6718a1c4bbd0206db3223b7e4f55d3eafc146067cd1d

See more details on using hashes here.

File details

Details for the file sqlitesearch-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: sqlitesearch-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for sqlitesearch-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2190ed8d833afa01d9f236131803b33087e0e6927cf59d8672787a2174aa9e4f
MD5 0751f05c951c2f201e7085cfb3a41734
BLAKE2b-256 bd13e8a812db2654144160996d97b3722c7262e5af7bec6178ef0fd22ab18843

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page