Stream massive datasets, embed at scale, store as parquet in S3.
Project description
supernova
Generate massive pre-embedded datasets, then load them into vector databases.
Overview
supernova has two pipelines:
- Embedding -- stream data from HuggingFace, embed with dense and/or sparse models, write parquet to S3
- Loading -- stream pre-embedded parquet from S3/HuggingFace into vector stores (Qdrant)
Both pipelines are streaming (never loads the full dataset into memory), pluggable (add new sources/embedders/stores by subclassing), and parallelizable (SkyPilot for distributed embedding and loading).
Quickstart
uv sync
# 1. Embed a dataset locally
nova embed configs/embedder/nick007x_arxiv_papers.yaml
# 2. Embed distributed across SkyPilot GPU pool
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml
# 3. Load into Qdrant
nova load configs/loader/ccnews_bge_large.yaml
# 4. Distributed loading (SkyPilot)
nova load-dist configs/loader/ccnews_bge_large.yaml
Project structure
supernova/
sources/ # Data sources (HuggingFace)
embedders/
dense/ # Dense embedding backends (OpenAI, sentence-transformers)
sparse/ # Sparse embedding backends (sentence-transformers SparseEncoder)
engine.py # EmbeddingEngine -- orchestrates dense/sparse/hybrid
hybrid.py # HybridEmbedder -- single forward pass for both
storage/ # Output backends (S3, HuggingFace Hub, local)
pipeline/ # Embedding orchestration (runner, worker, buffer)
loader/
datasource/ # Parquet readers (S3, HuggingFace)
vectorstore/ # Vector store backends (Qdrant)
runner.py # Loading orchestration
configs/
embedder/ # Embedding pipeline configs (single + distributed)
loader/ # Loading pipeline configs (single + distributed)
scripts/
run_embedder.py # supernova CLI
run_embed_distributed.py # nova embed-dist CLI
run_loader.py # nova load CLI
run_load_distributed.py # nova load-dist CLI
Pipeline 1: Embedding
Configuration
source:
type: huggingface
dataset_name: nick007x/arxiv-papers
split: train
text_field: abstract
dense_embedder:
type: sentence_transformer # or openai
model: Alibaba-NLP/gte-multilingual-base
trust_remote_code: true
batch_size: 64
dtype: bfloat16
pipeline:
chunk_size: 100000
num_workers: 2
storage:
type: s3 # or hf, local
bucket: qdrant--vectorforge
prefix: arxiv-papers/gte-multilingual-base
output_dir: /tmp/supernova
Sparse embeddings
Add a sparse_embedder section to produce sparse vectors alongside dense:
dense_embedder:
type: sentence_transformer
model: Alibaba-NLP/gte-multilingual-base
trust_remote_code: true
batch_size: 64
dtype: bfloat16
sparse_embedder:
type: sentence_transformer
model: Alibaba-NLP/gte-multilingual-base
batch_size: 64
dtype: bfloat16
When both point to the same model, supernova automatically uses a hybrid encoder to minimize forward passes. You must specify at least one of dense_embedder or sparse_embedder.
Dense embedders
| Type | Config key | Notes |
|---|---|---|
| OpenAI | openai |
model, dimensions, batch_size, max_concurrent, base_url, api_key |
| Sentence Transformers | sentence_transformer |
model, batch_size, dtype. Auto-detects CUDA/MPS/CPU |
The OpenAI embedder supports any OpenAI-compatible API via base_url (llama.cpp, vLLM, Ollama, etc). Set api_key: none for local servers that don't require auth.
Storage backends
| Type | Config key | Notes |
|---|---|---|
| S3 | s3 |
bucket, prefix |
| HuggingFace Storage Buckets | hf |
bucket_id, optional prefix, private. Writes to hf://buckets/{bucket_id}/... |
| Local | local |
output_dir |
Running locally
nova embed configs/embedder/nick007x_arxiv_papers.yaml
Running at scale with SkyPilot
SkyPilot pools create GPU workers and distribute embedding jobs across them. Workers are reused -- setup happens once, not per-slice.
# Preview the plan
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --dry-run
# Run (default: A10G spot, autoscaling)
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml
# Custom parallelism
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --num-jobs 20
Override resources in your config:
resources:
accelerators: A10G:1
cloud: aws
use_spot: true
Output format
Parquet files with this schema:
| Column | Type | Description |
|---|---|---|
row_id |
int64 | Auto-incrementing record ID |
source_row_id |
int64 | Original row in the source dataset |
chunk_id |
int32 | Pipeline batch / slice ID |
chunk_index |
int32 | Position within a text split (0 if not split) |
text |
string | The embedded text |
dense_embedding |
list<float32> | Dense embedding vector (when configured) |
sparse_embedding |
struct{indices, values} | Sparse embedding (when configured) |
Query with DuckDB:
SELECT row_id, text[:80] AS preview, length(dense_embedding) AS dim
FROM 's3://qdrant--vectorforge/dataset/model/**/*.parquet'
LIMIT 10;
Pipeline 2: Loading
Configuration
vectors: # one entry per Qdrant vector name
dense:
type: dense # dense | sparse | multivector
column: dense_embedding # parquet column to read
distance: cosine # cosine | dot | euclid | manhattan
datasource:
type: s3 # s3 or huggingface
bucket: qdrant--vectorforge
prefix: stanford-oval--ccnews/baai_bge_large_en_v1.5
id_column: row_id # default
payload_fields: # what goes into the vector store payload
text: text # payload key: parquet column name
title: title
vectorstore:
type: qdrant
collection_name: ccnews-bge-large
url: ${QDRANT_URL} # env var substitution
api_key: ${QDRANT_API_KEY}
loader:
batch_size: 1000 # points per upsert
prefetch_size: 100000 # rows per DuckDB fetch (default: batch_size * 10)
concurrency: 8 # parallel upsert tasks
Running
nova load configs/loader/ccnews_bge_large.yaml
Datasources
| Type | Config key | Notes |
|---|---|---|
| S3 | s3 |
bucket, prefix. Streams via DuckDB httpfs |
| HuggingFace | huggingface |
repo_id, optional subdir. Streams via DuckDB hf:// protocol |
Vector stores
| Type | Config key | Notes |
|---|---|---|
| Qdrant | qdrant |
url, api_key, collection_name. Retry with backoff on timeouts |
How it works
- DuckDB streams parquet data in large prefetch chunks (minimizes S3 round trips)
- Chunks are sliced into upsert-sized batches and written concurrently via asyncio
- Deferred indexing -- HNSW construction is disabled during load, then enabled for one efficient batch build
- Failed upserts are retried with exponential backoff
Distributed loading with SkyPilot
For terabyte-scale datasets, fan out across SkyPilot spot instances:
nova load-dist configs/loader/ccnews_bge_large.yaml
nova load-dist configs/loader/ccnews_bge_large.yaml --dry-run
nova load-dist configs/loader/ccnews_bge_large.yaml --num-shards 20
Environment variables
| Variable | Required for |
|---|---|
OPENAI_API_KEY |
OpenAI embedder |
HF_TOKEN |
HuggingFace Hub storage / datasource |
AWS_ACCESS_KEY_ID |
S3 storage / datasource |
AWS_SECRET_ACCESS_KEY |
S3 storage / datasource |
AWS_SESSION_TOKEN |
S3 with AWS SSO |
QDRANT_URL |
Qdrant vector store |
QDRANT_API_KEY |
Qdrant vector store |
Tests
uv run pytest tests/ -v
Documentation
- Introduction -- concepts, mental model, architecture diagrams
- Installation -- setup, environment variables, SkyPilot configuration
- Quickstart -- embed a dataset and load it into Qdrant end-to-end
- Embedding Generation -- dense/sparse embedders, SkyPilot at scale, output format
- Data Loading -- column mapping, payload composition, distributed loading
- Loader Architecture -- internal design docs
- AWS SSO Setup -- configuring AWS SSO credentials
- SkyPilot -- distributed compute setup and cost estimates
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file supernova-0.1.1.tar.gz.
File metadata
- Download URL: supernova-0.1.1.tar.gz
- Upload date:
- Size: 94.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e3a52cd2f6b2e11fce27b1a258e4df79ccef2698d76eb56e71a74a303dc698f
|
|
| MD5 |
bad8e8c5f11bb5532d1e81f00e99790b
|
|
| BLAKE2b-256 |
bfea4ba5dc2601fec2c819d796be219b294aebc62bafc4cb0aa3e363b8a4b032
|
File details
Details for the file supernova-0.1.1-py3-none-any.whl.
File metadata
- Download URL: supernova-0.1.1-py3-none-any.whl
- Upload date:
- Size: 115.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbc08b5eae4ab625ccccbc5028f187b5e63d3fe3190bd4d28196553f124d9ab6
|
|
| MD5 |
b0ba78fd75e78fc868594d58873c31f8
|
|
| BLAKE2b-256 |
842636be96c675c6761a288637439c488481030935b4384e28dfa04940054fe6
|