dbt + Rust vectorization runner for pgvector

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kraftaa

These details have not been verified by PyPI

Environment
- Console
Intended Audience
- Developers
Programming Language
- Python :: 3
- Rust
Topic
- Database
- Scientific/Engineering :: Artificial Intelligence

Project description

dbt-vectors (prototype scaffold)

Make vector indexes a first-class materialization in dbt. This repo is an MVP scaffold to prove the concept.

Why

dbt today only materializes SQL artifacts (table, view, incremental, ephemeral).
Vector pipelines require SQL + embeddings + upsert to a vector DB; teams currently stitch that with ad-hoc external scripts.
A custom vector_index materialization can run inside dbt build, generating embeddings, handling incremental logic, and writing to pgvector/Pinecone/Qdrant.

What’s here

dbt package skeleton with a vector_index materialization and dispatchable macros (pgvector working).
Rust embedder (rust/embedding_engine) that can generate embeddings via OpenAI, Amazon Bedrock, or a local ONNX model (no Python needed).
./bin/vectorize runner: orchestrates dbt run for the model and then calls the Rust embedder to write embeddings into Postgres/pgvector.
Examples to show how a model is defined and run.

Prerequisites

dbt-vectorize does not vendor dbt. It uses whatever dbt binary you point it to (DBT=...) or find on PATH.

Verify your existing dbt + adapter:

dbt --version

You should see a plugin like postgres under "Plugins".

If you do not have dbt + postgres adapter installed:

python -m pip install "dbt-core~=1.9" "dbt-postgres~=1.9"

You also need pgvector available in Postgres:

install the extension package on the Postgres server (vector.control must exist on that server)
enable it in each database you want to use

CREATE EXTENSION IF NOT EXISTS vector;

(pgvector is the project name; the SQL extension name is vector.)

Repo layout

dbt_project.yml – declares this as a dbt package and exposes macros.
macros/materializations/vector_index.sql – Jinja materialization scaffold (pgvector first, adapters dispatchable).
macros/adapters/vector_index_pgvector.sql – pgvector adapter macro that creates/loads the target table.
bin/vectorize – orchestration command that runs dbt and then Rust embedding.
rust/embedding_engine – Rust crate and pg_embedder binary used for embedding generation/upsert.

Next steps (MVP path)

Harden Rust embedding provider support (OpenAI/Bedrock/local ONNX) with better diagnostics and retries. ⏳
Expand adapter macros beyond pgvector (Pinecone/Qdrant). ⏳
Add end-to-end integration tests for dbt + pgvector + pg_embedder. ⏳
Publish package docs and a reproducible quickstart. ⏳

Example model (current)

{{ config(
    materialized='vector_index',
    vector_db='pgvector',
    index_name='knowledge_base',
    embedding_model='text-embedding-3-small',
    dimensions=(env_var('EMBED_DIMS', 1536) | int),
    metadata_columns=['source', 'created_at', 'doc_id']
) }}

select
    doc_id,
    chunk_text as text,
    source,
    created_at
from {{ ref('staging_documents') }}
where is_active = true

Running ./bin/vectorize --select vector_knowledge_base should:

fetch incremental rows
generate embeddings via Rust engine
upsert to pgvector (or Pinecone/Qdrant via adapters)
emit metrics (processed, failed, latency) and freshness tests

Run locally (preferred: existing local Postgres)

Ensure Postgres is running, reachable (PGHOST/PGPORT/PGUSER/PGDATABASE), and has vector enabled:

CREATE EXTENSION IF NOT EXISTS vector;

Choose a provider and matching dimensions:

# Local ONNX (MiniLM, 384 dims)
EMBED_PROVIDER=local
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBED_LOCAL_MODEL_PATH=$PWD/ml_model   # contains model.onnx + tokenizer.json
EMBED_DIMS=384

# OpenAI
EMBED_PROVIDER=openai
EMBED_MODEL=text-embedding-3-small
EMBED_DIMS=1536   # or a smaller dim if you request it from OpenAI

# Bedrock Titan v2 (defaults)
EMBED_PROVIDER=bedrock
EMBED_MODEL=amazon.titan-embed-text-v2:0
EMBED_DIMS=1024   # or 512/256 if you override

Run vectorization (dbt model + embedding upsert):

PGHOST=localhost PGPORT=5432 PGUSER=postgres PGDATABASE=postgres \
EMBED_PROVIDER=... EMBED_MODEL=... EMBED_DIMS=... \
./bin/vectorize --select vector_knowledge_base

Shortcut with env file:

cp .env.vectorize.example .env.vectorize
./bin/vectorize --select vector_knowledge_base

bin/vectorize auto-loads .env.vectorize if present. Use VECTORIZE_ENV_FILE=/path/to/file to load a different env file.

Expected CLI output (example):

[vectorize] running dbt model vector_knowledge_base (provider=local, model=sentence-transformers/all-MiniLM-L6-v2)
...
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
[vectorize] generating embeddings via Rust into public.knowledge_base
embedded 20 rows into public.knowledge_base
[vectorize] done.

Quick verification in Postgres:

SELECT count(*) AS rows FROM public.knowledge_base;

SELECT
  doc_id,
  (embedding::float4[])[1:8] AS first_8_dims,
  source,
  created_at
FROM public.knowledge_base
LIMIT 5;

Optional Docker Postgres

Use this only if you want a disposable local pgvector instance:

docker-compose up -d postgres

If Docker/Colima is not running, this will fail with a daemon connection error.

Build pip package (`dbt-vectorize`)

Build from repo root (factorlens-style, bundles Rust binary in wheel):

./scripts/build_wheel_with_binary.sh

Artifacts will be written to dist/. Install locally:

python -m pip install dist/dbt_vectorize-*.whl

CLI entrypoint after install:

dbt-vectorize --select vector_knowledge_base

CI release wheel build (macOS arm64 + Linux x86_64):

workflow file: .github/workflows/release.yml
trigger manually from Actions or push a v* tag
outputs platform-specific wheels under workflow artifacts / GitHub release assets

Supported embedding dimensions (set `EMBED_DIMS` to match)

OpenAI text-embedding-3-small: 1536 (can request smaller via API parameter)
OpenAI text-embedding-3-large: 3072 (can request smaller)
Bedrock Titan embed text v2: 1024 (or 512/256)
Bedrock Titan embed text v1: 1024 (or 512/256)
Bedrock Cohere Embed v4: 1536 (or 1024/512/256)
Local MiniLM (all-MiniLM-L6-v2 ONNX): 384

Notes

The Rust embedder is Python-free.
Keep your Postgres vector column dimension aligned with EMBED_DIMS.
IVFFLAT indexes warn on very small datasets; that’s expected. Rebuild after you have more rows.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kraftaa

These details have not been verified by PyPI

Environment
- Console
Intended Audience
- Developers
Programming Language
- Python :: 3
- Rust
Topic
- Database
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.1.11

Mar 31, 2026

0.1.8

Mar 31, 2026

0.1.6

Mar 31, 2026

0.1.5

Mar 30, 2026

This version

0.1.4

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_vectorize-0.1.4.tar.gz (17.1 MB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl (15.8 MB view details)

Uploaded Mar 28, 2026 Python 3macOS 10.13+ universal2 (ARM64, x86-64)

File details

Details for the file dbt_vectorize-0.1.4.tar.gz.

File metadata

Download URL: dbt_vectorize-0.1.4.tar.gz
Upload date: Mar 28, 2026
Size: 17.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_vectorize-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`4e30a9501303c2bd7463ee8a1d376d3e8e09511f5b98af678d710c7d4cb078a6`
MD5	`f639ba28c7e1f140e7ea25c56ca8e8fc`
BLAKE2b-256	`097680b00b10d34ae29f0f5c1fa487433da95e057434986c22b15283f4b4d6ca`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_vectorize-0.1.4.tar.gz:

Publisher: release.yml on kraftaa/dbt-vector

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbt_vectorize-0.1.4.tar.gz
- Subject digest: 4e30a9501303c2bd7463ee8a1d376d3e8e09511f5b98af678d710c7d4cb078a6
- Sigstore transparency entry: 1191453258
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: kraftaa/dbt-vector@99e27ee2413bcd99f0bcddc4e9c28ea1b31ba707
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@99e27ee2413bcd99f0bcddc4e9c28ea1b31ba707
- Trigger Event: push

File details

Details for the file dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl.

File metadata

Download URL: dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl
Upload date: Mar 28, 2026
Size: 15.8 MB
Tags: Python 3, macOS 10.13+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl
Algorithm	Hash digest
SHA256	`c167435a8bc461a34c5e748b02e1af598cf6fbfbab2cb7fbec0e28ab17831338`
MD5	`b051eca7b424fcf8bf151dc52be59005`
BLAKE2b-256	`bf6cbf22eadbecfa4826cf096b22fdda9632cd4e560b666f5b69a497ba6da1b7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl:

Publisher: release.yml on kraftaa/dbt-vector

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbt_vectorize-0.1.4-py3-none-macosx_10_13_universal2.whl
- Subject digest: c167435a8bc461a34c5e748b02e1af598cf6fbfbab2cb7fbec0e28ab17831338
- Sigstore transparency entry: 1191453265
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: kraftaa/dbt-vector@99e27ee2413bcd99f0bcddc4e9c28ea1b31ba707
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/kraftaa
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@99e27ee2413bcd99f0bcddc4e9c28ea1b31ba707
- Trigger Event: push

dbt-vectorize 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dbt-vectors (prototype scaffold)

Why

What’s here

Prerequisites

Repo layout

Next steps (MVP path)

Example model (current)

Run locally (preferred: existing local Postgres)

Optional Docker Postgres

Build pip package (dbt-vectorize)

Supported embedding dimensions (set EMBED_DIMS to match)

Notes

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Build pip package (`dbt-vectorize`)

Supported embedding dimensions (set `EMBED_DIMS` to match)