Stream massive datasets, embed at scale, store as parquet in S3.

These details have not been verified by PyPI

Project links

Project description

supernova

Generate massive pre-embedded datasets, then load them into vector databases.

Overview

supernova has two pipelines:

Embedding -- stream data from HuggingFace, embed with dense and/or sparse models, write parquet to S3
Loading -- stream pre-embedded parquet from S3/HuggingFace into vector stores (Qdrant)

Both pipelines are streaming (never loads the full dataset into memory), pluggable (add new sources/embedders/stores by subclassing), and parallelizable (SkyPilot for distributed embedding and loading).

Quickstart

uv sync

# 1. Embed a dataset locally
nova embed configs/embedder/nick007x_arxiv_papers.yaml

# 2. Embed distributed across SkyPilot GPU pool
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml

# 3. Load into Qdrant
nova load configs/loader/ccnews_bge_large.yaml

# 4. Distributed loading (SkyPilot)
nova load-dist configs/loader/ccnews_bge_large.yaml

Project structure

supernova/
  sources/            # Data sources (HuggingFace)
  embedders/
    dense/            # Dense embedding backends (OpenAI, sentence-transformers)
    sparse/           # Sparse embedding backends (sentence-transformers SparseEncoder)
    engine.py         # EmbeddingEngine -- orchestrates dense/sparse/hybrid
    hybrid.py         # HybridEmbedder -- single forward pass for both
  storage/            # Output backends (S3, HuggingFace Hub, local)
  pipeline/           # Embedding orchestration (runner, worker, buffer)
  loader/
    datasource/       # Parquet readers (S3, HuggingFace)
    vectorstore/      # Vector store backends (Qdrant)
    runner.py         # Loading orchestration

configs/
  embedder/           # Embedding pipeline configs (single + distributed)
  loader/             # Loading pipeline configs (single + distributed)

scripts/
  run_embedder.py           # supernova CLI
  run_embed_distributed.py  # nova embed-dist CLI
  run_loader.py             # nova load CLI
  run_load_distributed.py   # nova load-dist CLI

Pipeline 1: Embedding

Configuration

source:
  type: huggingface
  dataset_name: nick007x/arxiv-papers
  split: train
  text_field: abstract

dense_embedder:
  type: sentence_transformer    # or openai
  model: Alibaba-NLP/gte-multilingual-base
  trust_remote_code: true
  batch_size: 64
  dtype: bfloat16

pipeline:
  chunk_size: 100000
  num_workers: 2

storage:
  type: s3                      # or hf, local
  bucket: qdrant--vectorforge
  prefix: arxiv-papers/gte-multilingual-base
  output_dir: /tmp/supernova

Sparse embeddings

Add a sparse_embedder section to produce sparse vectors alongside dense:

dense_embedder:
  type: sentence_transformer
  model: Alibaba-NLP/gte-multilingual-base
  trust_remote_code: true
  batch_size: 64
  dtype: bfloat16

sparse_embedder:
  type: sentence_transformer
  model: Alibaba-NLP/gte-multilingual-base
  batch_size: 64
  dtype: bfloat16

When both point to the same model, supernova automatically uses a hybrid encoder to minimize forward passes. You must specify at least one of dense_embedder or sparse_embedder.

Dense embedders

Type	Config key	Notes
OpenAI	`openai`	`model`, `dimensions`, `batch_size`, `max_concurrent`, `base_url`, `api_key`
Sentence Transformers	`sentence_transformer`	`model`, `batch_size`, `dtype`. Auto-detects CUDA/MPS/CPU

The OpenAI embedder supports any OpenAI-compatible API via base_url (llama.cpp, vLLM, Ollama, etc). Set api_key: none for local servers that don't require auth.

Storage backends

Type	Config key	Notes
S3	`s3`	`bucket`, `prefix`
HuggingFace Storage Buckets	`hf`	`bucket_id`, optional `prefix`, `private`. Writes to `hf://buckets/{bucket_id}/...`
Local	`local`	`output_dir`

Running locally

nova embed configs/embedder/nick007x_arxiv_papers.yaml

Running at scale with SkyPilot

SkyPilot pools create GPU workers and distribute embedding jobs across them. Workers are reused -- setup happens once, not per-slice.

# Preview the plan
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --dry-run

# Run (default: A10G spot, autoscaling)
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml

# Custom parallelism
nova embed-dist configs/embedder/nick007x_arxiv_papers.yaml --num-jobs 20

Override resources in your config:

resources:
  accelerators: A10G:1
  cloud: aws
  use_spot: true

Output format

Parquet files with this schema:

Column	Type	Description
`row_id`	int64	Auto-incrementing record ID
`source_row_id`	int64	Original row in the source dataset
`chunk_id`	int32	Pipeline batch / slice ID
`chunk_index`	int32	Position within a text split (0 if not split)
`text`	string	The embedded text
`dense_embedding`	list<float32>	Dense embedding vector (when configured)
`sparse_embedding`	struct{indices, values}	Sparse embedding (when configured)

Query with DuckDB:

SELECT row_id, text[:80] AS preview, length(dense_embedding) AS dim
FROM 's3://qdrant--vectorforge/dataset/model/**/*.parquet'
LIMIT 10;

Pipeline 2: Loading

Configuration

vectors:                            # one entry per Qdrant vector name
  dense:
    type: dense                     # dense | sparse | multivector
    column: dense_embedding         # parquet column to read
    distance: cosine                # cosine | dot | euclid | manhattan

datasource:
  type: s3                          # s3 or huggingface
  bucket: qdrant--vectorforge
  prefix: stanford-oval--ccnews/baai_bge_large_en_v1.5
  id_column: row_id                 # default
  payload_fields:                   # what goes into the vector store payload
    text: text                      # payload key: parquet column name
    title: title

vectorstore:
  type: qdrant
  collection_name: ccnews-bge-large
  url: ${QDRANT_URL}                # env var substitution
  api_key: ${QDRANT_API_KEY}

loader:
  batch_size: 1000                  # points per upsert
  prefetch_size: 100000             # rows per DuckDB fetch (default: batch_size * 10)
  concurrency: 8                    # parallel upsert tasks

Running

nova load configs/loader/ccnews_bge_large.yaml

Datasources

Type	Config key	Notes
S3	`s3`	`bucket`, `prefix`. Streams via DuckDB httpfs
HuggingFace	`huggingface`	`repo_id`, optional `subdir`. Streams via DuckDB `hf://` protocol

Vector stores

Type	Config key	Notes
Qdrant	`qdrant`	`url`, `api_key`, `collection_name`. Retry with backoff on timeouts

How it works

DuckDB streams parquet data in large prefetch chunks (minimizes S3 round trips)
Chunks are sliced into upsert-sized batches and written concurrently via asyncio
Deferred indexing -- HNSW construction is disabled during load, then enabled for one efficient batch build
Failed upserts are retried with exponential backoff

Distributed loading with SkyPilot

For terabyte-scale datasets, fan out across SkyPilot spot instances:

nova load-dist configs/loader/ccnews_bge_large.yaml
nova load-dist configs/loader/ccnews_bge_large.yaml --dry-run
nova load-dist configs/loader/ccnews_bge_large.yaml --num-shards 20

Environment variables

Variable	Required for
`OPENAI_API_KEY`	OpenAI embedder
`HF_TOKEN`	HuggingFace Hub storage / datasource
`AWS_ACCESS_KEY_ID`	S3 storage / datasource
`AWS_SECRET_ACCESS_KEY`	S3 storage / datasource
`AWS_SESSION_TOKEN`	S3 with AWS SSO
`QDRANT_URL`	Qdrant vector store
`QDRANT_API_KEY`	Qdrant vector store

Tests

uv run pytest tests/ -v

Documentation

Introduction -- concepts, mental model, architecture diagrams
Installation -- setup, environment variables, SkyPilot configuration
Quickstart -- embed a dataset and load it into Qdrant end-to-end
Embedding Generation -- dense/sparse embedders, SkyPilot at scale, output format
Data Loading -- column mapping, payload composition, distributed loading
Loader Architecture -- internal design docs
AWS SSO Setup -- configuring AWS SSO credentials
SkyPilot -- distributed compute setup and cost estimates

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Jun 17, 2026

0.1.4

Jun 8, 2026

0.1.3

Jun 5, 2026

0.1.2

Jun 3, 2026

This version

0.1.1

Jun 3, 2026

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supernova-0.1.1.tar.gz (94.0 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

supernova-0.1.1-py3-none-any.whl (115.3 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file supernova-0.1.1.tar.gz.

File metadata

Download URL: supernova-0.1.1.tar.gz
Upload date: Jun 3, 2026
Size: 94.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for supernova-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`1e3a52cd2f6b2e11fce27b1a258e4df79ccef2698d76eb56e71a74a303dc698f`
MD5	`bad8e8c5f11bb5532d1e81f00e99790b`
BLAKE2b-256	`bfea4ba5dc2601fec2c819d796be219b294aebc62bafc4cb0aa3e363b8a4b032`

See more details on using hashes here.

File details

Details for the file supernova-0.1.1-py3-none-any.whl.

File metadata

Download URL: supernova-0.1.1-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 115.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for supernova-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbc08b5eae4ab625ccccbc5028f187b5e63d3fe3190bd4d28196553f124d9ab6`
MD5	`b0ba78fd75e78fc868594d58873c31f8`
BLAKE2b-256	`842636be96c675c6761a288637439c488481030935b4384e28dfa04940054fe6`

See more details on using hashes here.

supernova 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

supernova

Overview

Quickstart

Project structure

Pipeline 1: Embedding

Configuration

Sparse embeddings

Dense embedders

Storage backends

Running locally

Running at scale with SkyPilot

Output format

Pipeline 2: Loading

Configuration

Running

Datasources

Vector stores

How it works

Distributed loading with SkyPilot

Environment variables

Tests

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes