A standalone system for making automated, incremental updates to a vector database.
Project description
SpruceUp
SpruceUp is a standalone system for making automated, incremental updates to a vector database.
Installation
Add SpruceUp to your project (e.g., with poetry, pip, or uv):
poetry add spruceup-ai # OR
pip install spruceup-ai # OR
uv add spruceup-ai
Setup
Create a file named spruceup_pipeline.py in your project directory. This is the user-authored entry point SpruceUp loads at startup. It must export a single config variable built with define_config().
# spruceup_pipeline.py
import re
import os
from dataclasses import dataclass
from spruceup import define_config, FileProps, LocalFilesSource, PgVectorTarget, OpenAIEmbedder
@dataclass
class ArticleChunk:
title: str
content: str
embedding: list[float]
def split_into_paragraphs(text: str) -> list[str]:
return [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]
async def transform(*, file_props: FileProps, embed) -> list[ArticleChunk]:
paragraphs = split_into_paragraphs(file_props.raw_content)
embeddings = await embed(paragraphs)
return [
ArticleChunk(title=file_props.display_name, content=para, embedding=emb)
for para, emb in zip(paragraphs, embeddings)
]
config = define_config(
sources=[LocalFilesSource(watched_dir="./articles")],
target=PgVectorTarget(
connstr=os.environ["PG_CONNSTR"],
table="article_chunks",
schema=ArticleChunk,
vector_column="embedding",
),
embedder=OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"]),
transform=transform,
)
Running SpruceUp
From the directory containing your spruceup_pipeline.py file, with your virtual environment activated:
spruceup start
SpruceUp will scan your sources, sync any files not yet in the manifest, then enter a watch loop for incremental updates.
Imports
Everything you need is importable from the top-level spruceup package:
from spruceup import (
define_config,
FileProps,
# Sources
LocalFilesSource,
GoogleDriveSource,
# Targets
PgVectorTarget,
PineconeTarget,
WeaviateTarget,
# Embedders
OpenAIEmbedder,
CohereEmbedder,
GeminiEmbedder,
VoyageAIEmbedder,
# Utilities
memoize,
)
define_config()
config = define_config(
sources=[...],
target=...,
embedder=...,
transform=...,
)
All parameters are keyword-only.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sources |
list[SourceConnector] |
Yes | — | At least one source connector |
target |
TargetConnector |
Yes | — | Where synced chunks are written |
embedder |
EmbedderConnector |
Yes | — | Generates embeddings for your chunks |
transform |
async callable |
Yes | — | Converts a file into a list of chunks |
cache_files |
bool |
No | False |
Cache raw file bytes in the manifest |
The Transform Function
The transform function is where you split, enrich, and embed your documents. SpruceUp calls it for every file that changes. This function must be async.
async def transform(*, file_props: FileProps, embed) -> list[YourSchema]:
...
FileProps
| Field | Type | Description |
|---|---|---|
raw_content |
str | bytes |
File content. Text formats are decoded as UTF-8; binary formats like PDF are passed through as raw bytes. |
display_name |
str |
The filename |
file_type |
str |
File extension (e.g. "txt", "pdf") |
embed
embed is an async callable that takes a list of strings and returns a list of embedding vectors:
embeddings: list[list[float]] = await embed(["chunk one", "chunk two"])
Chunk Schema
Your transform returns a list of instances of a user-defined dataclass. SpruceUp uses this schema for diffing and for writing to the target store. Define it as a plain dataclass:
@dataclass
class MyChunk:
title: str
text: str
embedding: list[float]
All target connectors support str, int, float, bool, and list[float] as field types. Use list[float] for your embedding vector. You do not need to define an id field. SpruceUp generates one from each chunk's content hash.
Source Connectors
LocalFilesSource
Watches a local directory for file changes.
LocalFilesSource(watched_dir="./data")
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
watched_dir |
str |
Yes | — | Path to the directory to watch |
GoogleDriveSource
Watches a Google Drive folder for file changes. Requires the drive.readonly OAuth scope.
GoogleDriveSource(
watched_dir="<folder-id>",
on_token_expired=get_access_token,
recursive=True,
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
watched_dir |
str |
Yes | — | Google Drive folder ID |
on_token_expired |
Callable[[], str] |
Yes | — | Called when the access token expires; must return a fresh token string |
recursive |
bool |
No | True |
Whether to watch subfolders |
The on_token_expired callback is invoked whenever the connector needs a new OAuth token. It should return a valid access token or raise an exception.
Target Connectors
PgVectorTarget
Syncs chunks to a PostgreSQL table using the pgvector extension.
PgVectorTarget(
connstr="postgresql://user:pass@localhost/mydb",
table="my_chunks",
schema=MyChunk,
vector_column="embedding",
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
connstr |
str |
Yes | — | PostgreSQL connection string |
table |
str |
Yes | — | Table name |
schema |
type |
Yes | — | Your chunk dataclass |
vector_column |
str |
Yes | — | Field name on your schema that holds the vector |
SpruceUp creates the table and its columns automatically based on your schema's type hints. The pgvector extension must be installed on your database.
PineconeTarget
Syncs chunks to a Pinecone index.
PineconeTarget(
api_key="pc-...",
index_name="my-index",
schema=MyChunk,
vector_column="embedding",
namespace="",
metric="cosine",
cloud="aws",
region="us-east-1",
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key |
str | None |
Yes | — | Pinecone API key |
index_name |
str |
Yes | — | Name of the Pinecone index |
schema |
type |
Yes | — | Your chunk dataclass |
vector_column |
str |
Yes | — | Field name on your schema that holds the vector |
namespace |
str |
No | "" |
Namespace within the index |
metric |
str |
No | "cosine" |
Distance metric ("cosine", "euclidean", "dotproduct") |
cloud |
str |
No | "aws" |
Cloud provider |
region |
str |
No | "us-east-1" |
Cloud region |
WeaviateTarget
Syncs chunks to a Weaviate collection.
# Local instance
WeaviateTarget(
collection_name="MyChunks",
schema=MyChunk,
vector_column="embedding",
url="http://localhost:8080",
)
# Weaviate Cloud
WeaviateTarget(
collection_name="MyChunks",
schema=MyChunk,
vector_column="embedding",
cluster_url="https://my-cluster.weaviate.network",
api_key="wvp-...",
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
collection_name |
str |
Yes | — | Weaviate collection name |
schema |
type |
Yes | — | Your chunk dataclass |
vector_column |
str |
Yes | — | Field name on your schema that holds the vector |
url |
str |
No | "http://localhost:8080" |
URL for a local Weaviate instance |
cluster_url |
str | None |
No | None |
URL for a Weaviate Cloud cluster |
api_key |
str | None |
No | None |
API key for Weaviate Cloud authentication |
Use either url for a local instance or cluster_url + api_key for a cloud deployment.
Embedder Connectors
SpruceUp runs a health check at startup that embeds a test string and reads the actual output size from the API. The embedding_dimensions parameter is optional on all embedders. If omitted, the dimension is detected automatically. If provided, SpruceUp validates it matches what the API actually returns and raises an error if not.
OpenAIEmbedder
OpenAIEmbedder(
api_key="sk-...",
model="text-embedding-3-small",
max_batch_size=150,
embedding_dimensions=None,
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key |
str | Callable[[], str] |
Yes | — | OpenAI API key, or a callable that returns one |
model |
str |
No | "text-embedding-3-small" |
Embedding model |
max_batch_size |
int |
No | 150 |
Max texts per API call |
embedding_dimensions |
int | None |
No | None |
Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup. |
CohereEmbedder
CohereEmbedder(
api_key="...",
model="embed-v4.0",
max_batch_size=96,
embedding_dimensions=None,
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key |
str | Callable[[], str] |
Yes | — | Cohere API key, or a callable that returns one |
model |
str |
No | "embed-v4.0" |
Embedding model |
max_batch_size |
int |
No | 96 |
Max texts per API call |
embedding_dimensions |
int | None |
No | None |
Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup. |
When using an embed-v4 model with a custom embedding_dimensions, the value must be one of 256, 512, 1024, or 1536.
GeminiEmbedder
GeminiEmbedder(
api_key="...",
model="gemini-embedding-001",
max_batch_size=100,
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key |
str | Callable[[], str] |
Yes | — | Google Generative AI API key, or a callable that returns one |
model |
str |
No | "gemini-embedding-001" |
Embedding model |
max_batch_size |
int |
No | 100 |
Max texts per API call (hard limit: 100) |
embedding_dimensions |
int | None |
No | None |
Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup. |
VoyageAIEmbedder
VoyageAIEmbedder(
api_key="...",
model="voyage-4-large",
max_batch_size=150,
embedding_dimensions=None,
)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
api_key |
str | Callable[[], str] |
Yes | — | Voyage AI API key, or a callable that returns one |
model |
str |
No | "voyage-4-large" |
Embedding model |
max_batch_size |
int |
No | 150 |
Max texts per API call |
embedding_dimensions |
int | None |
No | None |
Override output dimensions. If omitted, SpruceUp reads the actual dimension from the API at startup. |
When using a voyage-4 model with a custom embedding_dimensions, the value must be one of 256, 512, 1024, or 2048.
@memoize
The memoize decorator caches the results of expensive subfunctions inside your transform. Results are stored in the SpruceUp manifest (a local SQLite database), scoped per file and invalidated automatically when the decorated function's body changes.
from spruceup import memoize
import asyncio
@memoize(return_type=str)
async def summarize(text: str) -> str:
# expensive LLM call
...
async def transform(*, file_props: FileProps, embed) -> list[MyChunk]:
chunk_strs = split_into_chunks(file_props.raw_content)
# summarize each chunk concurrently; results are cached per file
summaries = await asyncio.gather(*[summarize(c) for c in chunk_strs])
embeddings = await embed(chunk_strs)
return [
MyChunk(content=c, summary=s, embedding=e)
for c, s, e in zip(chunk_strs, summaries, embeddings)
]
| Parameter | Type | Required | Description |
|---|---|---|---|
return_type |
type |
Yes | Return type of the decorated function — used for serialization |
Supported return types: str, int, float, bool, list, dict.
memoize only works on async functions. Decorating a sync function raises a TypeError. It can only be used inside a transform function. Calling a memoized function outside of a transform context will raise a RuntimeError.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spruceup_ai-0.2.0.tar.gz.
File metadata
- Download URL: spruceup_ai-0.2.0.tar.gz
- Upload date:
- Size: 139.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0589b4c5f6d7188efdf77678ed1cf966f98bc8202816a5a91892abc1d4179e7
|
|
| MD5 |
553fcb4b21db74670b8cb1290b6564ba
|
|
| BLAKE2b-256 |
4d42c77e829a5953c3f94cef35ce7b0927afa292e4be11b1bb147ad50388ab9a
|
File details
Details for the file spruceup_ai-0.2.0-py3-none-any.whl.
File metadata
- Download URL: spruceup_ai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 46.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6270f5ed3706515e59f26d29d8adbece8c567ba516bb67c1c53f404a6efe0b82
|
|
| MD5 |
d23c673d16f2ef747723c883cb9b56ce
|
|
| BLAKE2b-256 |
02010d39e4821914298bb79ec000a7b7d3eb3c285f851e023a312e618334e9ab
|