Skip to main content

vespaembed: no-code training for embedding models

Project description

VespaEmbed

No-code training for embedding models. Train custom embedding models with a web UI or CLI.

Deploy on HF Spaces

Features

  • Web UI - Visual interface for configuring and monitoring training
  • CLI - Command-line interface for scripting and automation
  • Multiple Tasks - Support for pairs, triplets, similarity scoring, and unsupervised learning
  • Loss Variants - Choose from multiple loss functions per task
  • Matryoshka Embeddings - Train multi-dimensional embeddings for flexible retrieval
  • LoRA Support - Parameter-efficient fine-tuning with LoRA adapters
  • Unsloth Integration - Faster training with Unsloth optimizations
  • HuggingFace Integration - Load datasets, models from HuggingFace Hub, push models to Hub

Installation

pip install vespaembed

Optional Dependencies

# For Unsloth acceleration (requires NVIDIA/AMD GPU)
pip install vespaembed[unsloth]

# For ONNX export
pip install vespaembed[onnx]

Development Installation

git clone https://github.com/vespaai-playground/vespaembed.git
cd vespaembed
uv sync --extra dev

Quick Start

Web UI

Launch the web interface:

vespaembed

Open http://localhost:8000 in your browser. The UI lets you:

  • Upload training data (CSV or JSONL)
  • Select task type and base model
  • Configure hyperparameters
  • Monitor training progress
  • Download trained models

CLI

Train a model from the command line:

vespaembed train \
  --data examples/data/pairs.csv \
  --task pairs \
  --base-model sentence-transformers/all-MiniLM-L6-v2 \
  --epochs 3

Or use a YAML config file:

vespaembed train --config config.yaml

Tasks

VespaEmbed supports 4 training tasks based on your data format:

Pairs

Text pairs for semantic search. Use when you have query-document pairs without explicit negatives.

Data format:

anchor,positive
What is machine learning?,Machine learning is a subset of AI...
How does photosynthesis work?,Photosynthesis converts sunlight...

Loss variants: mnr (default), mnr_symmetric, gist, cached_mnr, cached_gist

Triplets

Text triplets with hard negatives. Use when you have explicit negative examples.

Data format:

anchor,positive,negative
What is Python?,Python is a programming language...,A python is a large snake...

Loss variants: mnr (default), mnr_symmetric, gist, cached_mnr, cached_gist

Similarity

Text pairs with similarity scores (STS-style). Use when you have continuous similarity labels.

Data format:

sentence1,sentence2,score
A man is playing guitar,A person plays music,0.85
The cat is sleeping,A dog is running,0.12

Loss variants: cosine (default), cosent, angle

TSDAE

Unsupervised learning with denoising auto-encoder. Use when you only have unlabeled text for domain adaptation.

Data format:

text
Machine learning is transforming how we analyze data.
Natural language processing enables computers to understand human language.

Configuration

CLI Arguments

vespaembed train \
  --data <path>              # Training data (CSV, JSONL, or HF dataset)
  --task <task>              # Task type: pairs, triplets, similarity, tsdae
  --base-model <model>       # Base model name or path
  --project <name>           # Project name (optional)
  --eval-data <path>         # Evaluation data (optional)
  --epochs <n>               # Number of epochs (default: 3)
  --batch-size <n>           # Batch size (default: 32)
  --learning-rate <lr>       # Learning rate (default: 2e-5)
  --optimizer <opt>          # Optimizer (default: adamw_torch)
  --scheduler <sched>        # LR scheduler (default: linear)
  --matryoshka               # Enable Matryoshka embeddings
  --matryoshka-dims <dims>   # Dimensions (default: 768,512,256,128,64)
  --unsloth                  # Use Unsloth for faster training
  --subset <name>            # HuggingFace dataset subset
  --split <name>             # HuggingFace dataset split

Optimizers

Option Description
adamw_torch AdamW (default)
adamw_torch_fused Fused AdamW (faster on CUDA)
adamw_8bit 8-bit AdamW (memory efficient)
adafactor Adafactor (memory efficient, no momentum)
sgd SGD with momentum

Schedulers

Option Description
linear Linear decay (default)
cosine Cosine annealing
cosine_with_restarts Cosine with warm restarts
constant Constant learning rate
constant_with_warmup Constant after warmup
polynomial Polynomial decay

YAML Configuration

task: pairs
base_model: sentence-transformers/all-MiniLM-L6-v2

data:
  train: train.csv
  eval: eval.csv            # optional

training:
  epochs: 3
  batch_size: 32
  learning_rate: 2e-5
  warmup_ratio: 0.1
  weight_decay: 0.01
  fp16: true
  eval_steps: 500
  save_steps: 500
  logging_steps: 100
  optimizer: adamw_torch    # adamw_torch, adamw_8bit, adafactor, sgd
  scheduler: linear         # linear, cosine, constant, polynomial

output:
  dir: ./output
  push_to_hub: false
  hf_username: null

# Optional: LoRA configuration
lora:
  enabled: false
  r: 64
  alpha: 128
  dropout: 0.1
  target_modules: [query, key, value, dense]

# Optional: Matryoshka dimensions
matryoshka_dims: [768, 512, 256, 128, 64]

# Optional: Loss variant (uses task default if not specified)
loss_variant: mnr

HuggingFace Datasets

Load datasets directly from HuggingFace Hub:

vespaembed train \
  --data sentence-transformers/all-nli \
  --subset triplet \
  --split train \
  --task triplets \
  --base-model sentence-transformers/all-MiniLM-L6-v2

CLI Commands

Command Description
vespaembed Launch web UI (default)
vespaembed serve Launch web UI
vespaembed train Train a model
vespaembed evaluate Evaluate a model
vespaembed export Export model to ONNX
vespaembed info Show task information

Output

Trained models are saved to ~/.vespaembed/projects/<project-name>/:

~/.vespaembed/projects/my-project/
├── final/              # Final trained model
├── checkpoint-500/     # Training checkpoints
├── checkpoint-1000/
└── logs/               # TensorBoard logs

Column Aliases

VespaEmbed automatically recognizes common column name variations:

Task Expected Also Accepts
pairs anchor query, question, sent1, sentence1, text1
pairs positive document, answer, pos, sent2, sentence2, text2
triplets negative neg, hard_negative, sent3, sentence3, text3
similarity sentence1 sent1, text1, anchor, query
similarity sentence2 sent2, text2, positive, document
similarity score similarity, label, sim_score
tsdae text sentence, sentences, content, input

Important: Columns are matched by name (or alias), not by position. For example, with a pairs task:

  • [anchor, positive] or [query, document] → works ✓
  • [document, query] → still works (names identify roles, not position) ✓
  • [foo, bar] → fails (no matching column names or aliases) ✗

Columns named score, scores, label, or labels (and aliases like similarity) are treated as labels/targets.

Development

# Run tests
uv run pytest tests/

# Run tests with coverage
uv run pytest tests/ --cov=vespaembed

# Format code
make format

# Lint
make lint

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vespaembed-0.0.5.tar.gz (84.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vespaembed-0.0.5-py3-none-any.whl (79.2 kB view details)

Uploaded Python 3

File details

Details for the file vespaembed-0.0.5.tar.gz.

File metadata

  • Download URL: vespaembed-0.0.5.tar.gz
  • Upload date:
  • Size: 84.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vespaembed-0.0.5.tar.gz
Algorithm Hash digest
SHA256 c8215e5421eec2ba8b109d3a310d77e12dbb616e273194cca97dc510b777aed2
MD5 21676d3c20dda6fad841b1a1ec38bb35
BLAKE2b-256 408ca461925c3dbf1e88ec0464f950eaae8cf37b1eefb914b83d882e47ca22ea

See more details on using hashes here.

File details

Details for the file vespaembed-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: vespaembed-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 79.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vespaembed-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 20b90d44c8459d1528cf0a547478791a21b99f77ad40e1e74a01a6b57ec13f8f
MD5 bd1920ed513008af5866f7d2dcf05545
BLAKE2b-256 83ad142f9d3a496c6a2ff3776c4320fad23607cb9e72b9d9993a0503ab0ed6fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page