Skip to main content

A standalone system for making automated, incremental updates to a vector database.

Project description

SpruceUp

SpruceUp is a standalone system for making automated, incremental updates to a vector database.

Installation

Add SpruceUp to your project (e.g., with poetry, pip, or uv):

poetry add spruceup-ai # OR
pip install spruceup-ai # OR
uv add spruceup-ai

Setup

Create a file named spruceup_pipeline.py in your project directory. This is the user-authored entry point SpruceUp loads at startup. It must export a single config variable built with define_config().

# spruceup_pipeline.py
import re
import os
from dataclasses import dataclass
from spruceup import define_config, FileProps, LocalFilesSource, PgVectorTarget, OpenAIEmbedder

@dataclass
class ArticleChunk:
    title: str
    content: str
    embedding: list[float]

def split_into_paragraphs(text: str) -> list[str]:
    return [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]

async def transform(*, file_props: FileProps, embed) -> list[ArticleChunk]:
    paragraphs = split_into_paragraphs(file_props.raw_content)
    embeddings = await embed(paragraphs)
    return [
        ArticleChunk(title=file_props.display_name, content=para, embedding=emb)
        for para, emb in zip(paragraphs, embeddings)
    ]

config = define_config(
    sources=[LocalFilesSource(watched_dir="./articles")],
    target=PgVectorTarget(
        connstr=os.environ["PG_CONNSTR"],
        table="article_chunks",
        schema=ArticleChunk,
        vector_column="embedding",
    ),
    embedder=OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"]),
    transform=transform,
)

Running SpruceUp

From the directory containing your spruceup_pipeline.py file, with your virtual environment activated:

spruceup start

SpruceUp will scan your sources, sync any files not yet in the manifest, then enter a watch loop for incremental updates.


Imports

Everything you need is importable from the top-level spruceup package:

from spruceup import (
    define_config,
    FileProps,

    # Sources
    LocalFilesSource,
    GoogleDriveSource,

    # Targets
    PgVectorTarget,
    PineconeTarget,
    WeaviateTarget,

    # Embedders
    OpenAIEmbedder,
    CohereEmbedder,
    GeminiEmbedder,
    VoyageAIEmbedder,

    # Utilities
    memoize,
)

define_config()

config = define_config(
    sources=[...],
    target=...,
    embedder=...,
    transform=...,
)

All parameters are keyword-only.

Parameter Type Required Default Description
sources list[SourceConnector] Yes At least one source connector
target TargetConnector Yes Where synced chunks are written
embedder EmbedderConnector Yes Generates embeddings for your chunks
transform async callable Yes Converts a file into a list of chunks
cache_files bool No False Cache raw file bytes in the manifest

The Transform Function

The transform function is where you split, enrich, and embed your documents. SpruceUp calls it for every file that changes. This function must be async.

async def transform(*, file_props: FileProps, embed) -> list[YourSchema]:
    ...

FileProps

Field Type Description
raw_content str | bytes File content. Text formats are decoded as UTF-8; binary formats like PDF are passed through as raw bytes.
display_name str The filename
file_type str File extension (e.g. "txt", "pdf")

embed

embed is an async callable that takes a list of strings and returns a list of embedding vectors:

embeddings: list[list[float]] = await embed(["chunk one", "chunk two"])

Chunk Schema

Your transform returns a list of instances of a user-defined dataclass. SpruceUp uses this schema for diffing and for writing to the target store. Define it as a plain dataclass:

@dataclass
class MyChunk:
    title: str
    text: str
    embedding: list[float]

All target connectors support str, int, float, bool, and list[float] as field types. Use list[float] for your embedding vector. You do not need to define an id field. SpruceUp generates one from each chunk's content hash.


Source Connectors

LocalFilesSource

Watches a local directory for file changes.

LocalFilesSource(watched_dir="./data")
Parameter Type Required Default Description
watched_dir str Yes Path to the directory to watch

GoogleDriveSource

Watches a Google Drive folder for file changes. Requires the drive.readonly OAuth scope.

GoogleDriveSource(
    watched_dir="<folder-id>",
    on_token_expired=get_access_token,
    recursive=True,
)
Parameter Type Required Default Description
watched_dir str Yes Google Drive folder ID
on_token_expired Callable[[], str] Yes Called when the access token expires; must return a fresh token string
recursive bool No True Whether to watch subfolders

The on_token_expired callback is invoked whenever the connector needs a new OAuth token. It should return a valid access token or raise an exception.


Target Connectors

PgVectorTarget

Syncs chunks to a PostgreSQL table using the pgvector extension.

PgVectorTarget(
    connstr="postgresql://user:pass@localhost/mydb",
    table="my_chunks",
    schema=MyChunk,
    vector_column="embedding",
)
Parameter Type Required Default Description
connstr str Yes PostgreSQL connection string
table str Yes Table name
schema type Yes Your chunk dataclass
vector_column str Yes Field name on your schema that holds the vector

SpruceUp creates the table and its columns automatically based on your schema's type hints. The pgvector extension must be installed on your database.


PineconeTarget

Syncs chunks to a Pinecone index.

PineconeTarget(
    api_key="pc-...",
    index_name="my-index",
    schema=MyChunk,
    vector_column="embedding",
    namespace="",
    metric="cosine",
    cloud="aws",
    region="us-east-1",
)
Parameter Type Required Default Description
api_key str | None Yes Pinecone API key
index_name str Yes Name of the Pinecone index
schema type Yes Your chunk dataclass
vector_column str Yes Field name on your schema that holds the vector
namespace str No "" Namespace within the index
metric str No "cosine" Distance metric ("cosine", "euclidean", "dotproduct")
cloud str No "aws" Cloud provider
region str No "us-east-1" Cloud region

WeaviateTarget

Syncs chunks to a Weaviate collection.

# Local instance
WeaviateTarget(
    collection_name="MyChunks",
    schema=MyChunk,
    vector_column="embedding",
    url="http://localhost:8080",
)

# Weaviate Cloud
WeaviateTarget(
    collection_name="MyChunks",
    schema=MyChunk,
    vector_column="embedding",
    cluster_url="https://my-cluster.weaviate.network",
    api_key="wvp-...",
)
Parameter Type Required Default Description
collection_name str Yes Weaviate collection name
schema type Yes Your chunk dataclass
vector_column str Yes Field name on your schema that holds the vector
url str No "http://localhost:8080" URL for a local Weaviate instance
cluster_url str | None No None URL for a Weaviate Cloud cluster
api_key str | None No None API key for Weaviate Cloud authentication

Use either url for a local instance or cluster_url + api_key for a cloud deployment.


Embedder Connectors

SpruceUp runs a health check at startup that embeds a test string and reads the actual output size from the API. The embedding_dimensions parameter is optional on all embedders. If omitted, the dimension is detected automatically. If provided, SpruceUp validates it matches what the API actually returns and raises an error if not.

OpenAIEmbedder

OpenAIEmbedder(
    api_key="sk-...",
    model="text-embedding-3-small",
    max_batch_size=150,
    embedding_dimensions=None,
)
Parameter Type Required Default Description
api_key str | Callable[[], str] Yes OpenAI API key, or a callable that returns one
model str No "text-embedding-3-small" Embedding model
max_batch_size int No 150 Max texts per API call
embedding_dimensions int | None No None Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup.

CohereEmbedder

CohereEmbedder(
    api_key="...",
    model="embed-v4.0",
    max_batch_size=96,
    embedding_dimensions=None,
)
Parameter Type Required Default Description
api_key str | Callable[[], str] Yes Cohere API key, or a callable that returns one
model str No "embed-v4.0" Embedding model
max_batch_size int No 96 Max texts per API call
embedding_dimensions int | None No None Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup.

When using an embed-v4 model with a custom embedding_dimensions, the value must be one of 256, 512, 1024, or 1536.


GeminiEmbedder

GeminiEmbedder(
    api_key="...",
    model="gemini-embedding-001",
    max_batch_size=100,
)
Parameter Type Required Default Description
api_key str | Callable[[], str] Yes Google Generative AI API key, or a callable that returns one
model str No "gemini-embedding-001" Embedding model
max_batch_size int No 100 Max texts per API call (hard limit: 100)
embedding_dimensions int | None No None Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup.

VoyageAIEmbedder

VoyageAIEmbedder(
    api_key="...",
    model="voyage-4-large",
    max_batch_size=150,
    embedding_dimensions=None,
)
Parameter Type Required Default Description
api_key str | Callable[[], str] Yes Voyage AI API key, or a callable that returns one
model str No "voyage-4-large" Embedding model
max_batch_size int No 150 Max texts per API call
embedding_dimensions int | None No None Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup.

When using a voyage-4 model with a custom embedding_dimensions, the value must be one of 256, 512, 1024, or 2048.


@memoize

The memoize decorator caches the results of expensive subfunctions inside your transform. Results are stored in the SpruceUp manifest (a local SQLite database), scoped per file and invalidated automatically when the decorated function's body changes.

from spruceup import memoize
import asyncio

@memoize(return_type=str)
async def summarize(text: str) -> str:
    # expensive LLM call
    ...

async def transform(*, file_props: FileProps, embed) -> list[MyChunk]:
    chunk_strs = split_into_chunks(file_props.raw_content)
    # summarize each chunk concurrently; results are cached per file
    summaries = await asyncio.gather(*[summarize(c) for c in chunk_strs])
    embeddings = await embed(chunk_strs)
    return [
        MyChunk(content=c, summary=s, embedding=e)
        for c, s, e in zip(chunk_strs, summaries, embeddings)
    ]
Parameter Type Required Description
return_type type Yes Return type of the decorated function — used for serialization

Supported return types: str, int, float, bool, list, dict.

memoize only works on async functions. Decorating a sync function raises a TypeError. It can only be used inside a transform function. Calling a memoized function outside of a transform context will raise a RuntimeError.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spruceup_ai-0.2.0.tar.gz (139.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spruceup_ai-0.2.0-py3-none-any.whl (46.2 kB view details)

Uploaded Python 3

File details

Details for the file spruceup_ai-0.2.0.tar.gz.

File metadata

  • Download URL: spruceup_ai-0.2.0.tar.gz
  • Upload date:
  • Size: 139.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spruceup_ai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a0589b4c5f6d7188efdf77678ed1cf966f98bc8202816a5a91892abc1d4179e7
MD5 553fcb4b21db74670b8cb1290b6564ba
BLAKE2b-256 4d42c77e829a5953c3f94cef35ce7b0927afa292e4be11b1bb147ad50388ab9a

See more details on using hashes here.

File details

Details for the file spruceup_ai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: spruceup_ai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 46.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spruceup_ai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6270f5ed3706515e59f26d29d8adbece8c567ba516bb67c1c53f404a6efe0b82
MD5 d23c673d16f2ef747723c883cb9b56ce
BLAKE2b-256 02010d39e4821914298bb79ec000a7b7d3eb3c285f851e023a312e618334e9ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page