Skip to main content

SQLite + sqlite-vec semantic search datastore

Project description

4lt7ab-grimoire

The Python library behind grimoire — a single-file semantic datastore backed by SQLite and sqlite-vec. Entries hold metadata; keyword (FTS5) and semantic (vec0) indexing are independent, opt-in operations against the same entry id.

For the standalone CLI, see 4lt7ab-grimoire-cli.

Install

uv add '4lt7ab-grimoire[fastembed]'

The fastembed extra pulls the bundled FastembedEmbedder (ONNX-based, no service required). Drop the extra and implement the Embedder protocol to bring your own — see Custom embedders.

Mental model

A grimoire is a single SQLite file with three tables:

  • entry — metadata: (id, group_key, group_ref, payload, context). No searchable text lives here.
  • entry_fts — FTS5 row holding keyword_text + threshold_rank for one entry.
  • entry_vec — vec0 row holding embedding + semantic_text + partition + threshold_distance for one entry.

Plus a meta table that records the embedder's model and dimension at create time. Reopening with a mismatched embedder raises GrimoireMismatch.

Indexing is decoupled from creation. You add() an entry first, then keyword() and/or embed() it to make it searchable. An entry can have a row in zero, one, or both of the index tables. Re-indexing is explicit — call keyword() or embed() again on the same id and the existing row is replaced. The entry row is untouched.

This means:

  • Searchable text can change after an entry is created without affecting its id, group, or payload.
  • An entry can carry only a payload (no FTS, no vec) and still be addressable by id or (group_key, group_ref).
  • The same entry can move semantic partitions or change its FTS text without losing its identity.

Quickstart

from grimoire import grimoire
from grimoire.data.entry import Entry, Filters
from grimoire.embed import FastembedEmbedder

with grimoire.open("grimoire.db", embedder=FastembedEmbedder()) as g:
    [entry] = g.add([
        Entry(
            id=None,
            group_key="creature",
            group_ref="phoenix-001",
            payload={"habitat": "volcano"},
            context="discovered in the southern volcanic chain",
        ),
    ])

    g.keyword([(entry.id, "phoenix fire-bird ashes")])
    g.embed([(entry.id, "A solar phoenix reborn from its own ashes at dawn")])

    for hit in g.semantic_search("creatures that come back from the dead"):
        print(hit.entry.id, hit.distance, hit.semantic_text)

    for hit in g.keyword_search("phoenix"):
        print(hit.entry.id, hit.score, hit.keyword_text)

Imports

The library's surface lives across a few modules. Two common patterns:

# Module-style: pulls in `grimoire.open` and `grimoire.peek`.
from grimoire import grimoire
g = grimoire.open("grimoire.db", embedder=...)
stats = grimoire.peek("grimoire.db")

# Direct: useful when only one helper is needed.
from grimoire.grimoire import open as open_grimoire
g = open_grimoire("grimoire.db", embedder=...)

Data types and embedders are imported from their own modules:

from grimoire.data.entry import Entry, Filters, KeywordHit, SemanticHit
from grimoire.embed import Embedder, FastembedEmbedder, NoOpEmbedder
from grimoire.errors import GrimoireError, GrimoireMismatch, GrimoireNotFound, SchemaVersionError
from grimoire.mount import Mount

Public API

File lifecycle

grimoire.open(path, *, embedder) -> Grimoire

Open a SQLite file at path. An empty (or freshly-touched) file gets the schema installed and the embedder lock written. An initialized file is validated against the supplied embedder; GrimoireMismatch is raised on a different model or dimension. Returns a Grimoire ready to use as a context manager.

grimoire.peek(path) -> Peek

Read metadata and counts from a grimoire file without committing to it for use. Loads sqlite-vec to read the vec partition counts but does not require an embedder. Raises GrimoireNotFound if the path doesn't exist or the file lacks an embedder lock.

Grimoire(conn, embedder)

Direct constructor over an open SQLite connection. grimoire.open() is the normal entry point — this is exposed for callers that need to manage the connection themselves.

Context manager

__enter__ returns self; __exit__ commits on a clean exit and rolls back on an unhandled exception. The idiomatic form is:

with grimoire.open(path, embedder=...) as g:
    ...

Writing entries

add(entries: list[Entry]) -> list[Entry]

Insert entries. id on the input is ignored — a fresh ULID is assigned to each row. Returns the inserted entries with their assigned ids. Raises ValueError on a (group_key, group_ref) collision with an existing row or within the batch itself.

The embedder is not invoked. To make entries searchable, call keyword() or embed() after add.

update(entries: list[Entry]) -> list[Entry]

Rewrite group_key, group_ref, payload, and context on existing rows, identified by id. Wholesale: every supplied field replaces the stored value, including with None. Returns the entries that matched a row (silently skips ids that didn't). Raises ValueError on a (group_key, group_ref) collision.

For partial updates, fetch the entry first and replace only the fields you intend to change.

remove(ids: list[str]) -> list[str]

Delete entries and cascade to their FTS and vec rows. Returns the ids that were actually removed.

Indexing

keyword(items, *, threshold_rank=None) -> list[Entry]

Index (or re-index) entries for FTS5 keyword search. items is a list of (entry_id, keyword_text) tuples. An existing FTS row on the same id is replaced. threshold_rank is stored on every row written by this call.

Raises ValueError for unknown ids or for empty/whitespace keyword text.

embed(items, *, partition=None, threshold_distance=None) -> list[Entry]

Embed (or re-embed) entries for semantic search. items is a list of (entry_id, semantic_text) tuples. Issues one embed_many call across the batch. An existing vec row on the same id is replaced — useful for moving an entry to a different partition or updating its source text. threshold_distance is stored on every row written by this call.

Raises ValueError for unknown ids or for empty/whitespace semantic text.

keyword_remove(ids: list[str]) -> list[str]

Drop FTS rows for the given ids. Entries themselves are not affected. Returns the ids that had FTS rows.

embed_remove(ids: list[str]) -> list[str]

Drop vec rows for the given ids. Entries themselves are not affected. Returns the ids that had vec rows.

Reading

fetch(filters=None, limit=100, cursor=None) -> list[Entry]

Walk entries ordered by id (i.e. chronologically, since ids are ULIDs). filters is a Filters instance restricting by sets of id, group_key, and/or group_ref. cursor, if given, returns entries with id > cursor — pass the last id of the previous page to walk forward.

saved = g.fetch(limit=100)
next_page = g.fetch(limit=100, cursor=saved[-1].id)

keyword_search(query, filters=None, limit=None) -> list[KeywordHit]

Run an FTS5 BM25 search against entry_fts. query is passed straight to FTS5 — phrases ("exact phrase"), prefix (fire*), boolean operators (phoenix OR wyrm NOT egg). Malformed queries surface as sqlite3.OperationalError. Filters apply on the joined entry row. Empty/whitespace queries raise ValueError.

Returns KeywordHits carrying the entry, the indexed keyword_text, the stored threshold_rank, and a positive score (higher = better).

semantic_search(query, partition=None, limit=10) -> list[SemanticHit]

Embed query via embedder.embed, then run vec0 KNN. Pass partition to narrow KNN to one partition; omit it (or pass None) to span every partition. Returns SemanticHits carrying the entry, the source semantic_text, the stored threshold_distance, and the distance (lower = better).

Data shapes

Entry

@dataclass(frozen=True, slots=True)
class Entry:
    id: str | None        # None on input to `add`; assigned by the library
    group_key: str | None
    group_ref: str | None
    payload: dict[str, Any] | None
    context: str | None = None

Filters

@dataclass(frozen=True, slots=True)
class Filters:
    id: list[str] | None = None
    group_key: list[str] | None = None
    group_ref: list[str] | None = None

Each list, when given, restricts to entries whose field matches one of the listed values. Missing/None means no filter on that field.

KeywordHit

@dataclass(frozen=True, slots=True)
class KeywordHit:
    entry: Entry
    keyword_text: str | None
    threshold_rank: float | None
    score: float        # -bm25, so higher = better and non-negative

SemanticHit

@dataclass(frozen=True, slots=True)
class SemanticHit:
    entry: Entry
    semantic_text: str | None
    threshold_distance: float | None
    distance: float     # vec0 distance, lower = better

Peek

@dataclass(frozen=True, slots=True)
class Peek:
    model: str
    dimension: int
    schema_version: int
    entry_count: int
    group_counts: dict[str | None, int]       # by entry.group_key
    partition_counts: dict[str | None, int]   # by entry_vec.partition

Mount

grimoire.mount.Mount is a lightweight dataclass that publishes the on-disk layout convention shared with the CLI:

<path>/grimoire.db          # default DB
<path>/<name>/grimoire.db   # a named DB
<path>/__models__/          # shared embedder cache
<path>/grimoire.toml        # registry file (reserved, currently inert)
from grimoire.mount import Mount, create, destroy

m = Mount(path=Path("/some/dir"))
create(m)                       # idempotent; creates directories and touches the default DB file
m.exists()                      # all of registry + models dir + default DB exist?
m.db_path(None)                 # default DB path
m.db_path("notes")              # named DB path; validates name (lowercase alnum, `-`, `_`, no `__` prefix)
destroy(m)                      # `rm -rf` the entire mount, no undo

Names are normalized to lowercase and must match [a-z0-9_-]+. Names beginning with __ are reserved for grimoire's internal directories (__models__).

Custom embedders

Embedder is a Protocol. Implement four members:

class MyEmbedder:
    @property
    def model(self) -> str: ...
    @property
    def dimension(self) -> int: ...
    def embed(self, text: str) -> list[float]: ...
    def embed_many(self, texts: list[str]) -> list[list[float]]: ...

embed handles single-record paths (semantic_search). embed_many handles bulk paths (embed(items=...)) and is expected to amortize tokenization, model dispatch, or device transfers across the batch.

The model and dimension are written into the file on first create and locked. Reopening with a different model or dimension raises GrimoireMismatch.

Bundled embedders

  • FastembedEmbedder(model="BAAI/bge-small-en-v1.5", *, cache_folder=None) — ONNX-based local inference via fastembed. Requires the fastembed extra.
  • NoOpEmbedder — produces zero vectors with model="noop", dimension=1. For grimoires used only for keyword search, payload storage, or structured browsing — semantic_search against a NoOp grimoire returns entries in arbitrary order with distance near zero. The contract is satisfied structurally, but the result has no ranking value.

Errors

All errors derive from GrimoireError:

Error Raised when
GrimoireMismatch The provided embedder's model or dimension disagrees with the file's lock.
GrimoireNotFound A path was expected to be a grimoire and isn't (missing file, or a SQLite file without an embedder lock).
SchemaVersionError The file's PRAGMA user_version doesn't match the library's SCHEMA_VERSION. Pre-v1, recreate the file.

Concurrency

grimoire.open opens its SQLite connection in WAL mode with busy_timeout defaulted to SQLite's standard. Reads coexist with one writer; sustained high-concurrency writes still serialize at the SQLite level. The connection is bound to its constructing thread per Python's stdlib default.

Schema notes

Pre-v1, schema changes are not migrated in place. The library checks PRAGMA user_version against its expected SCHEMA_VERSION on every open; mismatches raise SchemaVersionError. The intended response is to recreate the file. Migration ergonomics get designed once v1 is on the table.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

4lt7ab_grimoire-0.0.16.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

4lt7ab_grimoire-0.0.16-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file 4lt7ab_grimoire-0.0.16.tar.gz.

File metadata

  • Download URL: 4lt7ab_grimoire-0.0.16.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for 4lt7ab_grimoire-0.0.16.tar.gz
Algorithm Hash digest
SHA256 da894ce266ec7c5ab808d3ae1fd3b2cf910ddb1bea9efd789a626baf6a6e7ec7
MD5 8d3150c5a3506883354e9c71fc8f6e59
BLAKE2b-256 98d8a112b90f62bee0d05c0bd060f653bac3eade2bfc23398185a551fe94b6e3

See more details on using hashes here.

File details

Details for the file 4lt7ab_grimoire-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: 4lt7ab_grimoire-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for 4lt7ab_grimoire-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 ab15d27348efa2b2d305d2f0f62c8f5fc8f30db68d141c49647bb75f53ff678e
MD5 7b7131bc2689075aacdbccdd3bd1eb5a
BLAKE2b-256 7124c7a0fb758a1117dcd42200f8aad9456079a087d76375dd2b7f9b86ef5ac3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page