Skip to main content

Stream massive datasets, embed at scale, store as parquet in S3.

Project description

supernova

supernova

Generate massive pre-embedded datasets, then load them into vector databases.

Overview

supernova has two pipelines:

  1. Embedding -- stream data from HuggingFace, embed with dense and/or sparse models, write parquet to S3
  2. Loading -- stream pre-embedded parquet from S3/HuggingFace into vector stores (Qdrant)

Both pipelines are streaming (never loads the full dataset into memory), pluggable (add new sources/embedders/stores by subclassing), and parallelizable (SkyPilot for distributed embedding and loading).

Quickstart

uv sync

# 1. Embed a dataset locally
nova embed configs/embedder/nick007x_arxiv_papers.yaml

# 2. Embed distributed across SkyPilot GPU pool
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml

# 3. Load into Qdrant
nova load configs/loader/ccnews_bge_large.yaml

# 4. Distributed loading (SkyPilot)
nova load-dist configs/loader/ccnews_bge_large.yaml

Project structure

supernova/
  sources/            # Data sources (HuggingFace)
  embedders/
    dense/            # Dense embedding backends (OpenAI, sentence-transformers)
    sparse/           # Sparse embedding backends (sentence-transformers SparseEncoder)
    engine.py         # EmbeddingEngine -- orchestrates dense/sparse/hybrid
    hybrid.py         # HybridEmbedder -- single forward pass for both
  storage/            # Output backends (S3, HuggingFace Hub, local)
  pipeline/           # Embedding orchestration (runner, worker, buffer)
  loader/
    datasource/       # Parquet readers (S3, HuggingFace)
    vectorstore/      # Vector store backends (Qdrant)
    runner.py         # Loading orchestration

configs/
  embedder/           # Embedding pipeline configs (single + distributed)
  loader/             # Loading pipeline configs (single + distributed)

scripts/
  run_embedder.py           # supernova CLI
  run_embed_distributed.py  # nova embed-dist CLI
  run_loader.py             # nova load CLI
  run_load_distributed.py   # nova load-dist CLI

Pipeline 1: Embedding

Configuration

source:
  type: huggingface
  dataset_name: nick007x/arxiv-papers
  split: train
  text_field: abstract

dense_embedder:
  type: sentence_transformer    # or openai
  model: Alibaba-NLP/gte-multilingual-base
  trust_remote_code: true
  batch_size: 64
  dtype: bfloat16

pipeline:
  chunk_size: 100000
  num_workers: 2

storage:
  type: s3                      # or hf, local
  bucket: qdrant--vectorforge
  prefix: arxiv-papers/gte-multilingual-base
  output_dir: /tmp/supernova

Sparse embeddings

Add a sparse_embedder section to produce sparse vectors alongside dense:

dense_embedder:
  type: sentence_transformer
  model: Alibaba-NLP/gte-multilingual-base
  trust_remote_code: true
  batch_size: 64
  dtype: bfloat16

sparse_embedder:
  type: sentence_transformer
  model: Alibaba-NLP/gte-multilingual-base
  batch_size: 64
  dtype: bfloat16

When both point to the same model, supernova automatically uses a hybrid encoder to minimize forward passes. You must specify at least one of dense_embedder or sparse_embedder.

Dense embedders

Type Config key Notes
OpenAI openai model, dimensions, batch_size, max_concurrent, base_url, api_key
Sentence Transformers sentence_transformer model, batch_size, dtype. Auto-detects CUDA/MPS/CPU

The OpenAI embedder supports any OpenAI-compatible API via base_url (llama.cpp, vLLM, Ollama, etc). Set api_key: none for local servers that don't require auth.

Storage backends

Type Config key Notes
S3 s3 bucket, prefix
HuggingFace Storage Buckets hf bucket_id, optional prefix, private. Writes to hf://buckets/{bucket_id}/...
Local local output_dir

Running locally

nova embed configs/embedder/nick007x_arxiv_papers.yaml

Running at scale with SkyPilot

SkyPilot pools create GPU workers and distribute embedding jobs across them. Workers are reused -- setup happens once, not per-slice.

# Preview the plan
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --dry-run

# Run (default: A10G spot, autoscaling)
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml

# Custom parallelism
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --num-jobs 20

Override resources in your config:

resources:
  accelerators: A10G:1
  cloud: aws
  use_spot: true

Output format

Parquet files with this schema:

Column Type Description
row_id int64 Auto-incrementing record ID
source_row_id int64 Original row in the source dataset
chunk_id int32 Pipeline batch / slice ID
chunk_index int32 Position within a text split (0 if not split)
text string The embedded text
dense_embedding list<float32> Dense embedding vector (when configured)
sparse_embedding struct{indices, values} Sparse embedding (when configured)

Query with DuckDB:

SELECT row_id, text[:80] AS preview, length(dense_embedding) AS dim
FROM 's3://qdrant--vectorforge/dataset/model/**/*.parquet'
LIMIT 10;

Pipeline 2: Loading

Configuration

vectors:                            # one entry per Qdrant vector name
  dense:
    type: dense                     # dense | sparse | multivector
    column: dense_embedding         # parquet column to read
    distance: cosine                # cosine | dot | euclid | manhattan

datasource:
  type: s3                          # s3 or huggingface
  bucket: qdrant--vectorforge
  prefix: stanford-oval--ccnews/baai_bge_large_en_v1.5
  id_column: row_id                 # default
  payload_fields:                   # what goes into the vector store payload
    text: text                      # payload key: parquet column name
    title: title

vectorstore:
  type: qdrant
  collection_name: ccnews-bge-large
  url: ${QDRANT_URL}                # env var substitution
  api_key: ${QDRANT_API_KEY}

loader:
  batch_size: 1000                  # points per upsert
  prefetch_size: 100000             # rows per DuckDB fetch (default: batch_size * 10)
  concurrency: 8                    # parallel upsert tasks

Running

nova load configs/loader/ccnews_bge_large.yaml

Datasources

Type Config key Notes
S3 s3 bucket, prefix. Streams via DuckDB httpfs
HuggingFace huggingface repo_id, optional subdir. Streams via DuckDB hf:// protocol

Vector stores

Type Config key Notes
Qdrant qdrant url, api_key, collection_name. Retry with backoff on timeouts

How it works

  1. DuckDB streams parquet data in large prefetch chunks (minimizes S3 round trips)
  2. Chunks are sliced into upsert-sized batches and written concurrently via asyncio
  3. Deferred indexing -- HNSW construction is disabled during load, then enabled for one efficient batch build
  4. Failed upserts are retried with exponential backoff

Distributed loading with SkyPilot

For terabyte-scale datasets, fan out across SkyPilot spot instances:

nova load-dist configs/loader/ccnews_bge_large.yaml
nova load-dist configs/loader/ccnews_bge_large.yaml --dry-run
nova load-dist configs/loader/ccnews_bge_large.yaml --num-shards 20

Environment variables

Variable Required for
OPENAI_API_KEY OpenAI embedder
HF_TOKEN HuggingFace Hub storage / datasource
AWS_ACCESS_KEY_ID S3 storage / datasource
AWS_SECRET_ACCESS_KEY S3 storage / datasource
AWS_SESSION_TOKEN S3 with AWS SSO
QDRANT_URL Qdrant vector store
QDRANT_API_KEY Qdrant vector store

Tests

uv run pytest tests/ -v

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supernova-0.1.2.tar.gz (99.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

supernova-0.1.2-py3-none-any.whl (123.6 kB view details)

Uploaded Python 3

File details

Details for the file supernova-0.1.2.tar.gz.

File metadata

  • Download URL: supernova-0.1.2.tar.gz
  • Upload date:
  • Size: 99.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for supernova-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6203c8c38a9da7c9c49f3a7adf206ac9ce69b44dc4469fc88497654c93a0180e
MD5 0a6e18489a16e6f9eea56bc25f07d75f
BLAKE2b-256 00d4c037a1785e9c22887ae82c7535d841d34e45c36865ca636154a62eac2e1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for supernova-0.1.2.tar.gz:

Publisher: release.yml on qdrant-labs/supernova

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supernova-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: supernova-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 123.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for supernova-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5178fbb7d612f7a1e2685764075ce223daebfe1765b8dbab2f5169fd39bf8112
MD5 dc4131844809de8c4a3ab8e7be7d0cc3
BLAKE2b-256 f15060de8aa6523596ccb0e2ece7117cb267874a3332aff740780a11d6d7b558

See more details on using hashes here.

Provenance

The following attestation bundles were made for supernova-0.1.2-py3-none-any.whl:

Publisher: release.yml on qdrant-labs/supernova

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page