Skip to main content

SQLite + sqlite-vec semantic search datastore

Project description

4lt7ab-grimoire

The Python library behind grimoire — a single-file semantic datastore backed by SQLite and sqlite-vec. Entries hold metadata; keyword (FTS5) and semantic (vec0) indexing are independent, opt-in operations against the same entry id.

For the standalone CLI, see 4lt7ab-grimoire-cli.

Install

uv add '4lt7ab-grimoire[fastembed]'

The fastembed extra pulls the bundled FastembedEmbedder (ONNX-based, no service required). Drop the extra and implement the Embedder protocol to bring your own — see Custom embedders.

Mental model

A grimoire is a single SQLite file with three tables:

  • entry — metadata: (id, group_key, group_ref, payload, context). No searchable text lives here.
  • entry_fts — FTS5 row holding keyword_text + threshold_rank for one entry.
  • entry_vec — vec0 row holding embedding + semantic_text + partition + threshold_distance for one entry.

Plus a meta table that records the embedder's model and dimension at create time. Reopening with a mismatched embedder raises GrimoireMismatch.

Indexing is decoupled from creation. You add() an entry first, then keyword() and/or embed() it to make it searchable. An entry can have a row in zero, one, or both of the index tables. Re-indexing is explicit — call keyword() or embed() again on the same id and the existing row is replaced. The entry row is untouched.

This means:

  • Searchable text can change after an entry is created without affecting its id, group, or payload.
  • An entry can carry only a payload (no FTS, no vec) and still be addressable by id or (group_key, group_ref).
  • The same entry can move semantic partitions or change its FTS text without losing its identity.

Quickstart

from grimoire import grimoire
from grimoire.data.entry import Entry, Filters
from grimoire.embed import FastembedEmbedder

with grimoire.open("grimoire.db", embedder=FastembedEmbedder()) as g:
    [entry] = g.add([
        Entry(
            id=None,
            group_key="creature",
            group_ref="phoenix-001",
            payload={"habitat": "volcano"},
            context="discovered in the southern volcanic chain",
        ),
    ])

    g.keyword([(entry.id, "phoenix fire-bird ashes")])
    g.embed([(entry.id, "A solar phoenix reborn from its own ashes at dawn")])

    for hit in g.semantic_search("creatures that come back from the dead"):
        print(hit.entry.id, hit.distance, hit.semantic_text)

    for hit in g.keyword_search("phoenix"):
        print(hit.entry.id, hit.score, hit.keyword_text)

Imports

The library's surface lives across a few modules. Two common patterns:

# Module-style: pulls in `grimoire.open` and `grimoire.peek`.
from grimoire import grimoire
g = grimoire.open("grimoire.db", embedder=...)
stats = grimoire.peek("grimoire.db")

# Direct: useful when only one helper is needed.
from grimoire.grimoire import open as open_grimoire
g = open_grimoire("grimoire.db", embedder=...)

Data types and embedders are imported from their own modules:

from grimoire.data.entry import Entry, Filters, KeywordHit, SemanticHit
from grimoire.embed import Embedder, FastembedEmbedder, NoOpEmbedder
from grimoire.errors import GrimoireError, GrimoireMismatch, GrimoireNotFound, SchemaVersionError
from grimoire.mount import Mount

Public API

File lifecycle

grimoire.open(path, *, embedder) -> Grimoire

Open a SQLite file at path. An empty (or freshly-touched) file gets the schema installed and the embedder lock written. An initialized file is validated against the supplied embedder; GrimoireMismatch is raised on a different model or dimension. Returns a Grimoire ready to use as a context manager.

grimoire.peek(path) -> Peek

Read metadata and counts from a grimoire file without committing to it for use. Loads sqlite-vec to read the vec partition counts but does not require an embedder. Raises GrimoireNotFound if the path doesn't exist or the file lacks an embedder lock.

Grimoire(conn, embedder)

Direct constructor over an open SQLite connection. grimoire.open() is the normal entry point — this is exposed for callers that need to manage the connection themselves.

Context manager

__enter__ returns self; __exit__ commits on a clean exit and rolls back on an unhandled exception. The idiomatic form is:

with grimoire.open(path, embedder=...) as g:
    ...

Writing entries

add(entries: list[Entry]) -> list[Entry]

Insert entries. id on the input is ignored — a fresh ULID is assigned to each row. Returns the inserted entries with their assigned ids. Raises ValueError on a (group_key, group_ref) collision with an existing row or within the batch itself.

The embedder is not invoked. To make entries searchable, call keyword() or embed() after add.

update(entries: list[Entry]) -> list[Entry]

Rewrite group_key, group_ref, payload, and context on existing rows, identified by id. Wholesale: every supplied field replaces the stored value, including with None. Returns the entries that matched a row (silently skips ids that didn't). Raises ValueError on a (group_key, group_ref) collision.

For partial updates, fetch the entry first and replace only the fields you intend to change.

remove(ids: list[str]) -> list[str]

Delete entries and cascade to their FTS and vec rows. Returns the ids that were actually removed.

Indexing

keyword(items, *, threshold_rank=None) -> list[Entry]

Index (or re-index) entries for FTS5 keyword search. items is a list of (entry_id, keyword_text) tuples. An existing FTS row on the same id is replaced. threshold_rank is stored on every row written by this call.

Raises ValueError for unknown ids or for empty/whitespace keyword text.

embed(items, *, partition=None, threshold_distance=None) -> list[Entry]

Embed (or re-embed) entries for semantic search. items is a list of (entry_id, semantic_text) tuples. Issues one embed_many call across the batch. An existing vec row on the same id is replaced — useful for moving an entry to a different partition or updating its source text. threshold_distance is stored on every row written by this call.

Raises ValueError for unknown ids or for empty/whitespace semantic text.

keyword_remove(ids: list[str]) -> list[str]

Drop FTS rows for the given ids. Entries themselves are not affected. Returns the ids that had FTS rows.

embed_remove(ids: list[str]) -> list[str]

Drop vec rows for the given ids. Entries themselves are not affected. Returns the ids that had vec rows.

Reading

fetch(filters=None, limit=100, cursor=None) -> list[Entry]

Walk entries ordered by id (i.e. chronologically, since ids are ULIDs). filters is a Filters instance restricting by sets of id, group_key, and/or group_ref. cursor, if given, returns entries with id > cursor — pass the last id of the previous page to walk forward.

saved = g.fetch(limit=100)
next_page = g.fetch(limit=100, cursor=saved[-1].id)

keyword_search(query, filters=None, limit=None) -> list[KeywordHit]

Run an FTS5 BM25 search against entry_fts. query is passed straight to FTS5 — phrases ("exact phrase"), prefix (fire*), boolean operators (phoenix OR wyrm NOT egg). Malformed queries surface as sqlite3.OperationalError. Filters apply on the joined entry row. Empty/whitespace queries raise ValueError.

Returns KeywordHits carrying the entry, the indexed keyword_text, the stored threshold_rank, and a positive score (higher = better).

semantic_search(query, partition=None, limit=10) -> list[SemanticHit]

Embed query via embedder.embed, then run vec0 KNN. Pass partition to narrow KNN to one partition; omit it (or pass None) to span every partition. Returns SemanticHits carrying the entry, the source semantic_text, the stored threshold_distance, and the distance (lower = better).

Data shapes

Entry

@dataclass(frozen=True, slots=True)
class Entry:
    id: str | None        # None on input to `add`; assigned by the library
    group_key: str | None
    group_ref: str | None
    payload: dict[str, Any] | None
    context: str | None = None

Filters

@dataclass(frozen=True, slots=True)
class Filters:
    id: list[str] | None = None
    group_key: list[str] | None = None
    group_ref: list[str] | None = None

Each list, when given, restricts to entries whose field matches one of the listed values. Missing/None means no filter on that field.

KeywordHit

@dataclass(frozen=True, slots=True)
class KeywordHit:
    entry: Entry
    keyword_text: str | None
    threshold_rank: float | None
    score: float        # -bm25, so higher = better and non-negative

SemanticHit

@dataclass(frozen=True, slots=True)
class SemanticHit:
    entry: Entry
    semantic_text: str | None
    threshold_distance: float | None
    distance: float     # vec0 distance, lower = better

Peek

@dataclass(frozen=True, slots=True)
class Peek:
    model: str
    dimension: int
    schema_version: int
    entry_count: int
    group_counts: dict[str | None, int]       # by entry.group_key
    partition_counts: dict[str | None, int]   # by entry_vec.partition

Mount

grimoire.mount.Mount is a lightweight dataclass that publishes the on-disk layout convention shared with the CLI:

<path>/grimoire.db          # default DB
<path>/<name>/grimoire.db   # a named DB
<path>/__models__/          # shared embedder cache
<path>/grimoire.toml        # registry file (reserved, currently inert)
from grimoire.mount import Mount, create, destroy

m = Mount(path=Path("/some/dir"))
create(m)                       # idempotent; creates directories and touches the default DB file
m.exists()                      # all of registry + models dir + default DB exist?
m.db_path(None)                 # default DB path
m.db_path("notes")              # named DB path; validates name (lowercase alnum, `-`, `_`, no `__` prefix)
destroy(m)                      # `rm -rf` the entire mount, no undo

Names are normalized to lowercase and must match [a-z0-9_-]+. Names beginning with __ are reserved for grimoire's internal directories (__models__).

Custom embedders

Embedder is a Protocol. Implement four members:

class MyEmbedder:
    @property
    def model(self) -> str: ...
    @property
    def dimension(self) -> int: ...
    def embed(self, text: str) -> list[float]: ...
    def embed_many(self, texts: list[str]) -> list[list[float]]: ...

embed handles single-record paths (semantic_search). embed_many handles bulk paths (embed(items=...)) and is expected to amortize tokenization, model dispatch, or device transfers across the batch.

The model and dimension are written into the file on first create and locked. Reopening with a different model or dimension raises GrimoireMismatch.

Bundled embedders

  • FastembedEmbedder(model="BAAI/bge-small-en-v1.5", *, cache_folder=None) — ONNX-based local inference via fastembed. Requires the fastembed extra.
  • NoOpEmbedder — produces zero vectors with model="noop", dimension=1. For grimoires used only for keyword search, payload storage, or structured browsing — semantic_search against a NoOp grimoire returns entries in arbitrary order with distance near zero. The contract is satisfied structurally, but the result has no ranking value.

Errors

All errors derive from GrimoireError:

Error Raised when
GrimoireMismatch The provided embedder's model or dimension disagrees with the file's lock.
GrimoireNotFound A path was expected to be a grimoire and isn't (missing file, or a SQLite file without an embedder lock).
SchemaVersionError The file's PRAGMA user_version doesn't match the library's SCHEMA_VERSION. Pre-v1, recreate the file.

Concurrency

grimoire.open opens its SQLite connection in WAL mode with busy_timeout defaulted to SQLite's standard. Reads coexist with one writer; sustained high-concurrency writes still serialize at the SQLite level. The connection is bound to its constructing thread per Python's stdlib default.

Schema notes

Pre-v1, schema changes are not migrated in place. The library checks PRAGMA user_version against its expected SCHEMA_VERSION on every open; mismatches raise SchemaVersionError. The intended response is to recreate the file. Migration ergonomics get designed once v1 is on the table.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

4lt7ab_grimoire-0.0.14.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

4lt7ab_grimoire-0.0.14-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file 4lt7ab_grimoire-0.0.14.tar.gz.

File metadata

  • Download URL: 4lt7ab_grimoire-0.0.14.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for 4lt7ab_grimoire-0.0.14.tar.gz
Algorithm Hash digest
SHA256 dcced0a16345a43095963ab995da6f1199554cdeb7ea334b5289b169fe89ca0e
MD5 5c321912a0f34bd3196aa3e6b6c2629d
BLAKE2b-256 ea714a1f71b393b71020a8780b41330ad88173aa98fe92d06bf2b27904cd383b

See more details on using hashes here.

File details

Details for the file 4lt7ab_grimoire-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: 4lt7ab_grimoire-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for 4lt7ab_grimoire-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 b2c2e69c366e3a1a6dedab6892df80e15ca5573acde1c55780aac4d90c159936
MD5 32c5c65689c767e0c1a7b05eeeb10740
BLAKE2b-256 c9acad9624270746a1f21e35375f60d152d214a2003260b06e8ab94bf48fb48a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page