Skip to main content

The CockroachDB Vector Embedding Toolkit is a developer-focused set of tools for generating, storing, and querying vector embeddings directly within existing CockroachDB tables. It enables prototyping similarity search and AI-backed workflows without the need for a separate vector database or complex architectural changes.

Project description

CockroachDB Vector Embedding Toolkit

This repository provides a developer-focused toolkit for adding and experimenting with vector embeddings directly inside existing CockroachDB tables. It shows how to generate, store, index, and query embeddings alongside relational data—without introducing a separate vector database or re-architecting your system. The tools are designed to let engineers and data practitioners prototype similarity search and vector-backed workflows on real application data using pluggable embedding models.

These tools show how CRDB supports vector-backed workflows inside an existing system without architectural sprawl.

vectorize.py

At the center of the toolkit is the vectorize.py script. It exposes a simple CLI with three subcommands that map directly to the core functions of the toolkit: embedding data, inspecting models, and running similarity search.

[!NOTE] This toolkit is intended for experimentation and prototyping. It favors simplicity and transparency over production-grade automation.

instrument

The instrument sub-command automates "instrumenting" the specified column to enable semantic search on this column's content.

  1. Creates the vector column. The dimensionality is derived from the selected embedding model (see the model subcommand below).
  2. Creates a vector index on this new column, as well as other auxiliary indexes that accelerate embedding generation and vector searches.
  3. Wires a trigger that resets the vector column to NULL when the source column is updated. This flags the row for the embedding generation process so the embedding will be regenerated.
$ python3 vectorize.py instrument -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full -t passage -i passage -o passage_vector -m hf_st_all_minilm_l6 -v

Instrumentation is defined with the following scope and rules:

  1. Instrumentation targets a specific source column within a specific table.
  2. A single source column may have multiple vector “shadow” columns (for example, when using different embedding models). Each vector column must be instrumented separately.
  3. Multiple source columns within the same table may be instrumented independently.
  4. Trigger wiring is managed per table. A single trigger handles update detection and nullifies only the vector column(s) associated with the source column that was modified, leaving other vector columns unchanged.

embed

The embed sub-command generates vector embeddings for rows in an existing CockroachDB table and stores them alongside the original data.

At a high level, it:

  • Reads values from a specified input column
  • Generates embeddings using the selected encoding model
  • Writes the resulting vectors into a specified output column
$ python3 vectorize.py embed -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full -t passage -i passage -o passage_vector -m hf_st_all_minilm_l6 -b 10 -n 1 -w 2 -v
[INFO] Column passage_vector already exists
[INFO] Run 1, Batch 1 starting (10 rows)
[INFO] (batch 1, 1/5) Updating vector for row id 02e2e3af-821e-4a3e-8115-ccb9d7042df9: 'Since the most expensive collection acti'
[INFO] (batch 1, 2/5) Updating vector for row id 02e2e580-b496-4ea3-9a70-ca1e13ffe7a3: 'And Jesus, knowing their thoughts, said '
[INFO] (batch 1, 3/5) Updating vector for row id 02e2fc69-a896-4046-8f7d-92042d97b0c9: 'Further Reading. A few good sources for '
[INFO] (batch 1, 4/5) Updating vector for row id 02e3095f-c8cd-4dd6-91e5-2c924bc55a48: 'See more results ». More examples. Her a'
[INFO] (batch 1, 5/5) Updating vector for row id 02e3101c-442e-43cf-a019-3a35e832bdae: 'In folklore the markhor is known to kill'
[INFO] (batch 1, 1/5) Updating vector for row id 02e313c9-d053-4a34-a9ef-daba8ad15eaf: 'Problem: One problem Father had in the s'
[INFO] (batch 1, 2/5) Updating vector for row id 02e3162f-288f-4238-aaad-189f4baa6d50: 'Not looking for Miami 305 area code info'
[INFO] (batch 1, 3/5) Updating vector for row id 02e31780-cfcd-4ab7-b9cc-fae0c5ac0fcc: 'Despite getting sent-off in the 1998 Wor'
[INFO] (batch 1, 4/5) Updating vector for row id 02e334f8-1212-4656-83f1-1411e8411397: 'The extra rigidity afforded by the tubes'
[INFO] (batch 1, 5/5) Updating vector for row id 02e34693-4372-482b-a332-0e35a62feb22: 'CONCERT REVIEWS: Alice Cooper and Cheap '
Done in 13.912834882736206 seconds
[INFO] Vectorization complete.

To keep the workflow simple and generic, the script only processes rows where the output vector column is NULL. There is no built-in mechanism to detect whether embeddings are out of date. If the source data in a row is updated, the corresponding vector column must be explicitly cleared (set to NULL) in order for embed to regenerate the embedding.

model

The model subcommand provides visibility into the available embedding models supported by the toolkit. It does not interact with the database.

It supports two operations:

  • model list lists all available encoding models
  • model desc <model> describes a specific model in more detail, including properties such as embedding dimensionality
$ python3 vectorize.py model list
hf_st_all_minilm_l6   Hugging Face Sentence Transformer all-MiniLM-L6-v2
openai_text_embed     OpenAI Text Embedding API
takara_ds1_fukuro     Takara-DS1/ds1-fukuro
$ python3 vectorize.py model desc hf_st_all_minilm_l6
------------------------------------------------------
| Hugging Face Sentence Transformer all-MiniLM-L6-v2 |
------------------------------------------------------
General-purpose English sentence embedding model
based on MiniLM. Optimized for semantic similarity,
clustering, and retrieval tasks. Produces 384-dimensional
float vectors. Not multilingual.
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

The vectorize.py script uses a pluggable model architecture. Encoding models are implemented as interchangeable modules with a small, well-defined interface. Instructions for adding new models are provided in a later section.

embed Performance Options

The embed command is designed to efficiently generate embeddings while keeping database writes predictable and low-contention. To achieve this, embedding computation and database updates are intentionally decoupled.

At a high level:

  • Embedding computation can run in parallel across multiple CPUs.
  • Generated embeddings are collected into batches.
  • Database updates are executed sequentially, in batches, to minimize write contention and maximize database efficiency.

The following options control this behavior.

Batch size (-b, --batch-size)

The batch size defines how many rows are updated in a single database write batch. Each batch groups together a fixed number of newly generated embeddings before they are written back to CockroachDB.

Larger batches reduce the number of database write operations, while smaller batches reduce per-batch resource usage. Batch size affects database write behavior but does not control parallelism.

Parallel workers (-w, --workers)

The workers option controls how many embeddings are calculated in parallel. Each worker independently computes embeddings for input rows, allowing the embedding step to utilize multiple CPUs.

Parallelism applies only to embedding computation. Database updates remain single-threaded and batched to avoid write contention.

Number of batches (-n, --num-batches)

The number of batches option limits how many batches are processed during a single invocation of embed. This provides a simple way to bound the amount of work performed before the command exits.

The total number of rows processed in one run is approximately the batch size multiplied by the number of batches.

Continuous operation (-F, --follow)

When run with --follow, the embed command continues running until there are no remaining rows where the output vector column is NULL. The same batching and parallelism rules apply, but the process does not exit after a fixed number of batches.

This mode enables continuous vectorization as new rows are inserted, without introducing additional logic for detecting stale or out-of-date embeddings.

Running embed as a Background Service (Docker)

The embed sub-command can be run in continuous mode using --follow/-F CLI option. In this mode, the process behaves like a lightweight daemon:

  • It continuously processes rows where the output vector column is NULL.
  • It exits only after a prolonged idle period (if configured to do so).

When all rows have corresponding embeddings (i.e., no NULL values remain in the vector column), the process does not immediately terminate. Instead:

  1. It enters a sleep period.
  2. It periodically re-checks for new rows requiring backfill.
  3. It resumes embedding if new NULL rows are detected.

This model allows embed to run safely inside a long-lived container, continuously handling new rows or source column updates.

The image can be built with:

docker build -t vectorize .

A typical Docker invocation might look like:

docker run --rm -d \
  --name vectorizer \
  -v $HOME/.postgresql:/root/.postgresql:ro \
  -v $(pwd)/config.yaml:/app/config.yaml:ro \
  -v $(pwd)/logs:/logs \
  vectorize \
  -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full \
  -t passage \
  -i passage \
  -o passage_vector \
  -m hf_st_all_minilm_l6 \
  -b 100 \
  -w 2 \
  -F \
  -v

In this configuration:

  • The container runs detached (-d).
  • --follow keeps the embedding process active.
  • The container may be restarted externally (e.g., via --restart unless-stopped) if desired.
  • The process exits naturally if no new embeddings are required for the configured maximum idle duration.

This approach avoids additional scheduling infrastructure while enabling continuous vectorization alongside your application workload.

When running inside a container, embed expects the following paths to be available inside the container filesystem:

  1. /root/.postgresql/: Directory containing the CockroachDB client certificates (for example, root.crt when using sslmode=verify-full).
  2. /app/config.yaml: Model configuration file, if required by the selected embedding model.
  3. /logs/: Directory for log output.

These paths must be provided by mounting the corresponding host directories or files at runtime.

search

The search subcommand performs semantic similarity search over rows that have already been vectorized.

Given an input text value, it:

  • Encodes the text using the same embedding model used during embed
  • Executes a vector similarity query directly inside CockroachDB
  • Returns the closest matching rows based on vector distance

Similarity search runs entirely within the database and can be combined with standard SQL filtering and querying patterns.

$ python3 vectorize.py search -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full -t passage -i passage -o passage_vector -m hf_st_all_minilm_l6 -l 10 -v "New York City is the financial capital of the world!"
[INFO] PK: id (uuid)

0.5445506274700036 --> 01cef214-02ff-4648-91f4-ab55031d3223
It pointed to the fact that New York spends more money per student than any other state - more than $18,000 per year - and yet ranks 39th in high school graduation rates. This is the kind of inflammatory rhetoric that The Wall Street Journal and The Washington Times routinely engage in.

0.5668426841340002 --> 013d665f-e68a-476f-ae24-3f139834f899
TOP banks in USA. JPMorgan Chase Bank, NA. Wells Fargo Bank, NA. Bank of America, NA. Citibank, NA. U.S. Bank NA. PNC Bank, NA. The Bank of New York Mellon. Capital One, NA.

0.5668426841340002 --> 014939ea-a998-42bf-88da-fff6ee017cf9
TOP banks in USA. JPMorgan Chase Bank, NA. Wells Fargo Bank, NA. Bank of America, NA. Citibank, NA. U.S. Bank NA. PNC Bank, NA. The Bank of New York Mellon. Capital One, NA.

0.6034485219881293 --> 02c437f2-a021-4b63-9f0c-6f85bc123afa
granddaddy of the many fine art museums in New York. Fast Facts & Information. Safety: New York is the largest city in the United States, and also is distinguished as having the lowest crime rate among the 25 largest American cities, according to the FBI Crime Report.

0.6072394549846538 --> 001da25f-eb02-4426-9c3a-19ad1e314b0f
Merrill Lynch Corporate Office & Headquarters. 4 World Financial Center 250 Vesey Street New York NY 10080.

0.6090512806742557 --> 02082f83-2a20-416b-a42b-20ab9e79308e
According to the Notice as at Sept. 30, 2002, registered national exchanges, include the American Stock Exchange, the Boston Stock Exchange, the Cincinnati Stock Exchange, the Chicago Stock Exchange, the NYSE, the Philadelphia Stock Exchange, and the Pacific Exchange, Inc.

0.6157920680541831 --> 0021d323-319c-4935-bf20-6666c3d9232d
Finance capital, in the constant drive to increase profit, uses technology to lower production costs by replacing human labour with machines and other labour-saving processes. Scientific and technological progress has become the source of increased exploitation and alienation of the working class.

0.6191794517101485 --> 02659fb0-63c7-48ae-bd48-8df78537d102
Since Colonial times Wall Street was the province of an elite few. That changed in the 1980s. Millions of Americans hitched their dreams to the stock market. America's economic tide rose in the 1980s, but more of the nation's wealth flowed to those who were already well-off.

0.6356513093605897 --> 019ebf8a-3578-4667-9f91-bf71b2d3957f
As tax season kicks off, NYC Free Tax Prep is ready | New York Amsterdam News: The new Black view.

0.6496077582356925 --> 00e3cea8-a9f2-4772-86b1-cfbf4ee40537
Cost of living data is from the Missouri Economic Research and Information Center. This is 24/7 Wall St.’s states doing the most (and least) to spread the wealth. Everyone Who Believes in God Should Watch This.

cleanup

The cleanup sub-command reverses the effects of instrument, restoring the table to its pre-embedding state.

  1. Removes the mechanism that nullifies the vector column when its source column is updated.
  2. Drops all indexes created during instrumentation of the specified column.
  3. Drops the vector column.

[!WARNING] Dropping the vector column deletes all stored embeddings. cleanup prompts separately before dropping indexes and before dropping the vector column. You may choose to remove the indexes while retaining the vector column.

Embedding Models

vectorize.py is design to decouple the vector embedding from the data pipeline / database mechanics. The vector encoding is delegate to the embedding model wrappers located in the models directory:

models/hf_st_all_minilm_l6.py
models/openai_text_embed.py

The model will be referenced by the name of the wrapper file, i.e. hf_st_all_minilm_l6 or openai_text_embed.py.

A model wrapper has to implement the following functions:

def embedding_label() -> str

A short description of the model. This is what vectorize.py model list will show:

$ python3 vectorize.py model list
hf_st_all_minilm_l6	Hugging Face Sentence Transformer all-MiniLM-L6-v2
openai_text_embed	OpenAI Text Embedding API
def embedding_description() -> str

A long description of the model, including the dimensionality and anything else that's important to the user to know. This is displayed by vectorize.py model desc f_st_all_minilm_l6:

$ python3 vectorize.py model desc hf_st_all_minilm_l6
------------------------------------------------------
| Hugging Face Sentence Transformer all-MiniLM-L6-v2 |
------------------------------------------------------
General-purpose English sentence embedding model
based on MiniLM. Optimized for semantic similarity,
clustering, and retrieval tasks. Produces 384-dimensional
float vectors. Not multilingual.
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
def embedding_dim() -> int

Returns the dimensionality of the model. It's used by the script to create the vector output column if it doesn't exist.

def embedding_index_opclass() -> str:
    return "vector_cosine_ops"

def embedding_index_operator() -> str:
    return "<=>"

These two functions are closely related. The first is used by embed during creting the vector index on the new vector column. The second is called by search to ensure that the comparison operator used in the search query aligns with the vector index thus telling the query planner to use the index to speed up the search. The following option exist:

Pupose opclass operator
L2 distance Use when you want the true geometric distance, such as in spatial or physical models where absolute positioning matters. vector_l2_ops <->
Cosine distance Use when you only care about directional similarity, like in semantic text matching or clustering. Ideal for retrieval-augmented generation (RAG) use cases involving pretrained embedding models that either normalize vectors or are trained with a cosine similarity loss. vector_cosine_ops <=>
Negative inner product Use when both the magnitude and direction of vectors matter, such as in scoring or preference modeling. vector_ip_ops <#>
def embedding_encode(
        input_data: Any,
        verbose: bool = False
    ) -> List[float]

This function takes a single row's input column data to be encoded, and returns the resulting vector.

[!NOTE] The input data may be of any type, not just text. The model needs to be aware of the data type it's embedding.

def embedding_encode_batch(
        batch_index: int,
        batch: Iterable[Tuple[Any, Any]],
        verbose: bool = False
    ) -> List[Tuple[Any, List[float]]]

This function takes:

  1. a batch index - this is for information purposes only and used to generate INFO message
  2. a list of tuples where each tuple contains the primary key and the input column value. Both can be of any data type.

It returns a resulting list of PK-Embedding tuples.

Debugging Models

A simple script called model_test.py exists in this repo that you can use to check / debug a model wrapper in the model directory. It invokes the expected API functions described above. It would be nice to turn this into a real unit test :-) You can run it as:

python3 model_test.py hf_st_all_minilm_l6

or

python3 model_test.py openai_text_embed

Configuring Model Run-Time

Some models require configuration. For example, the OpenAI model needs the API key. The repo has a template configuration file config_tmpl.yaml:

models:

  - hf_st_all_minilm_l6:
    nuclio:
      url: http://localhost:32772
      username: <username>
      password: <password>
      verify: False

  - openai_text_embed:
      api_key: OpenAI_API_Key
      model: text-embedding-3-small | text-embedding-3-large

The actual file that the models expect should be called config.yaml. Models don't have to use this config, but if a model does, its parameters should be exclusively under models/<model_name> section. It is the responsibility of the model wrapper to parse and utilize its associated parameters. For example, models/openai_text_embed.py:

config_path = Path(__file__).resolve().parent.parent / "config.yaml"
config = None
with open("config.yaml", "r") as file:
    config = yaml.safe_load(file)

openai_settings = next(
    item[Path(__file__).stem] 
    for item in config['models'] 
    if isinstance(item, dict) and 'openai_text_embed' in item
)

_client = OpenAI(api_key=openai_settings['api_key'])
_MODEL = openai_settings['model']

External Compute Support

Embedding generation can become the dominant cost when working with large existing datasets. While the toolkit is designed for simplicity, generating embeddings for thousands to millions of rows—especially with CPU- or memory-intensive models—can quickly exceed the practical limits of running everything within the same process or container as vectorize.py. To address this, the model abstraction supports executing embedding workloads on external compute backends such as Nuclio. In this mode, the model wrapper transparently switches from local, in-process execution to remote function invocation, while preserving the same interface and behavior expected by the toolkit. This allows embed to remain a lightweight orchestrator of data movement and batching, while embedding computation is scaled independently across distributed resources. From the user’s perspective, the workflow and CLI remain unchanged; only the model configuration determines whether embeddings are generated locally or delegated to external compute.

The Hugging Face model (hf_st_all_minilm_l6) included in this repository implements this pattern and serves as a reference for adding external execution support to other models. By convention, each model is defined as a Python wrapper in models/hf_st_all_minilm_l6.py, while the corresponding Nuclio function deployment is defined separately as models/hf_st_all_minilm_l6.yaml. Together, these provide a consistent structure for supporting both local and distributed embedding execution without changing the toolkit’s core behavior.

Configuration for external execution is defined in config.yaml, using the structure shown in config_tmpl.yaml. Each model may include an optional nuclio section under its configuration block, specifying the endpoint and any required connection settings. When this section is present, the model wrapper automatically routes embedding requests to the configured Nuclio function instead of executing locally. This keeps execution control entirely within the model configuration, without requiring changes to the CLI or toolkit workflow.

Sizing

The sizing operation estimates the storage impact introduced by vectorization, based on the current table, model, and indexing configuration. It reports the additional footprint of the vector column, vector index, and toolkit indexes, along with the resulting total table size and relative overhead percentages. The estimate is intentionally conservative, reflecting a worst-case view of storage growth rather than exact usage. This allows users to quantify the cost of vectorization after validating its usefulness, and make an informed decision about whether to carry the approach forward into production.

$ python3 vectorize.py size -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full -t passage -i passage -o passage_vector
┌───────────────────────────┬───────────────────────────────────────────┬────────────┐
│ Initial table size         passage                                          39.9G│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│ + Vector column            passage_vector                                    2.9G│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│ + Vector index             passage_passage_vector_idx                        2.3G│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│ + Toolkit indexes          passage_passage_vector_id_null_idx               18.7M│
│                            passage_passage_vector_id_not_null_idx          485.7M│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│ = Resulting table size     passage                                          45.6G│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│                       >>>                    Vector storage overhead        11.4%│
├───────────────────────────┼───────────────────────────────────────────┼────────────┤
│                       >>>                   Toolkit storage overhead         1.1%│
└───────────────────────────┴───────────────────────────────────────────┴────────────┘

Emitting search SQL

The sql operation generates the vector similarity search query for a given table, column, and model configuration. Unlike search, it does not execute the query — it outputs SQL ready to be integrated into application code or run directly in a SQL client.

It supports two modes:

  1. Without -s/--sample, it outputs a parameterized query template using positional placeholders (%s), intended for direct use in application code.
  2. With -s/--sample, it encodes the provided text using the selected model and outputs the complete query with the vector literal embedded inline, ready to execute in a SQL client.

In both modes, the vector operator and embedding dimensionality are derived from the selected model. The query uses AS OF SYSTEM TIME follower_read_timestamp() to enable follower reads, reducing latency for read-heavy similarity search workloads.

$ python3 vectorize.py size -u postgresql://<user>:<pass>@<dbhost>:26257/<database>?sslmode=verify-full -t passage -i passage -o passage_vector -m hf_st_all_minilm_l6
SELECT
    id,
    passage,
    ROUND(passage_vector <=> %s::VECTOR(1536), 6) AS distance
FROM passage
AS OF SYSTEM TIME follower_read_timestamp()
WHERE passage_vector IS NOT NULL
ORDER BY passage_vector <=> %s::VECTOR(1536)
LIMIT %s

Note:
'%s' are positional parameters. Bind in order:
    1) query vector,
    2) same query vector,
    3) limit.
Adjust syntax for your client library if needed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cockroachdb_vectors-0.1.2.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cockroachdb_vectors-0.1.2-py3-none-any.whl (38.1 kB view details)

Uploaded Python 3

File details

Details for the file cockroachdb_vectors-0.1.2.tar.gz.

File metadata

  • Download URL: cockroachdb_vectors-0.1.2.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.0 CPython/3.13.7 Linux/6.14.0-1014-aws

File hashes

Hashes for cockroachdb_vectors-0.1.2.tar.gz
Algorithm Hash digest
SHA256 06bf278ea3c06793b19717e1593e77e5dc4d6baea57b642c07a1131a6e834a36
MD5 2494b8dedd315797d5b557f3954cf9a1
BLAKE2b-256 ea7abef10efdb391f6d68dc04ceb3e86f2c72c3e01bd798838c5e76802910fda

See more details on using hashes here.

File details

Details for the file cockroachdb_vectors-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cockroachdb_vectors-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 38.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.0 CPython/3.13.7 Linux/6.14.0-1014-aws

File hashes

Hashes for cockroachdb_vectors-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 644f70d87824b1042edd185efc03a1894ce30b4ecf35a83f55a0f459f85644ce
MD5 e8d387bbfa7262796bed2d785f04d2b0
BLAKE2b-256 cde142d4087f8b52f72ca862df224bd2c70f61f7cea75342c6ac2e26a39d2ad6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page