vespaembed: no-code training for embedding models
Project description
VespaEmbed
No-code training for embedding models. Train custom embedding models with a web UI or CLI.
Features
- Web UI - Visual interface for configuring and monitoring training
- CLI - Command-line interface for scripting and automation
- Multiple Tasks - Support for pairs, triplets, similarity scoring, and unsupervised learning
- Loss Variants - Choose from multiple loss functions per task
- Matryoshka Embeddings - Train multi-dimensional embeddings for flexible retrieval
- LoRA Support - Parameter-efficient fine-tuning with LoRA adapters
- Unsloth Integration - Faster training with Unsloth optimizations
- HuggingFace Integration - Load datasets, models from HuggingFace Hub, push models to Hub
Installation
pip install vespaembed
Optional Dependencies
# For Unsloth acceleration (requires NVIDIA/AMD GPU)
pip install vespaembed[unsloth]
# For ONNX export
pip install vespaembed[onnx]
Development Installation
git clone https://github.com/vespaai-playground/vespaembed.git
cd vespaembed
uv sync --extra dev
Quick Start
Web UI
Launch the web interface:
vespaembed
Open http://localhost:8000 in your browser. The UI lets you:
- Upload training data (CSV or JSONL)
- Select task type and base model
- Configure hyperparameters
- Monitor training progress
- Download trained models
CLI
Train a model from the command line:
vespaembed train \
--data examples/data/pairs.csv \
--task pairs \
--base-model sentence-transformers/all-MiniLM-L6-v2 \
--epochs 3
Or use a YAML config file:
vespaembed train --config config.yaml
Tasks
VespaEmbed supports 4 training tasks based on your data format:
Pairs
Text pairs for semantic search. Use when you have query-document pairs without explicit negatives.
Data format:
anchor,positive
What is machine learning?,Machine learning is a subset of AI...
How does photosynthesis work?,Photosynthesis converts sunlight...
Loss variants: mnr (default), mnr_symmetric, gist, cached_mnr, cached_gist
Triplets
Text triplets with hard negatives. Use when you have explicit negative examples.
Data format:
anchor,positive,negative
What is Python?,Python is a programming language...,A python is a large snake...
Loss variants: mnr (default), mnr_symmetric, gist, cached_mnr, cached_gist
Similarity
Text pairs with similarity scores (STS-style). Use when you have continuous similarity labels.
Data format:
sentence1,sentence2,score
A man is playing guitar,A person plays music,0.85
The cat is sleeping,A dog is running,0.12
Loss variants: cosine (default), cosent, angle
TSDAE
Unsupervised learning with denoising auto-encoder. Use when you only have unlabeled text for domain adaptation.
Data format:
text
Machine learning is transforming how we analyze data.
Natural language processing enables computers to understand human language.
Configuration
CLI Arguments
vespaembed train \
--data <path> # Training data (CSV, JSONL, or HF dataset)
--task <task> # Task type: pairs, triplets, similarity, tsdae
--base-model <model> # Base model name or path
--project <name> # Project name (optional)
--eval-data <path> # Evaluation data (optional)
--epochs <n> # Number of epochs (default: 3)
--batch-size <n> # Batch size (default: 32)
--learning-rate <lr> # Learning rate (default: 2e-5)
--optimizer <opt> # Optimizer (default: adamw_torch)
--scheduler <sched> # LR scheduler (default: linear)
--matryoshka # Enable Matryoshka embeddings
--matryoshka-dims <dims> # Dimensions (default: 768,512,256,128,64)
--unsloth # Use Unsloth for faster training
--subset <name> # HuggingFace dataset subset
--split <name> # HuggingFace dataset split
Optimizers
| Option | Description |
|---|---|
adamw_torch |
AdamW (default) |
adamw_torch_fused |
Fused AdamW (faster on CUDA) |
adamw_8bit |
8-bit AdamW (memory efficient) |
adafactor |
Adafactor (memory efficient, no momentum) |
sgd |
SGD with momentum |
Schedulers
| Option | Description |
|---|---|
linear |
Linear decay (default) |
cosine |
Cosine annealing |
cosine_with_restarts |
Cosine with warm restarts |
constant |
Constant learning rate |
constant_with_warmup |
Constant after warmup |
polynomial |
Polynomial decay |
YAML Configuration
task: pairs
base_model: sentence-transformers/all-MiniLM-L6-v2
data:
train: train.csv
eval: eval.csv # optional
training:
epochs: 3
batch_size: 32
learning_rate: 2e-5
warmup_ratio: 0.1
weight_decay: 0.01
fp16: true
eval_steps: 500
save_steps: 500
logging_steps: 100
optimizer: adamw_torch # adamw_torch, adamw_8bit, adafactor, sgd
scheduler: linear # linear, cosine, constant, polynomial
output:
dir: ./output
push_to_hub: false
hf_username: null
# Optional: LoRA configuration
lora:
enabled: false
r: 64
alpha: 128
dropout: 0.1
target_modules: [query, key, value, dense]
# Optional: Matryoshka dimensions
matryoshka_dims: [768, 512, 256, 128, 64]
# Optional: Loss variant (uses task default if not specified)
loss_variant: mnr
HuggingFace Datasets
Load datasets directly from HuggingFace Hub:
vespaembed train \
--data sentence-transformers/all-nli \
--subset triplet \
--split train \
--task triplets \
--base-model sentence-transformers/all-MiniLM-L6-v2
CLI Commands
| Command | Description |
|---|---|
vespaembed |
Launch web UI (default) |
vespaembed serve |
Launch web UI |
vespaembed train |
Train a model |
vespaembed evaluate |
Evaluate a model |
vespaembed export |
Export model to ONNX |
vespaembed info |
Show task information |
Output
Trained models are saved to ~/.vespaembed/projects/<project-name>/:
~/.vespaembed/projects/my-project/
├── final/ # Final trained model
├── checkpoint-500/ # Training checkpoints
├── checkpoint-1000/
└── logs/ # TensorBoard logs
Column Aliases
VespaEmbed automatically recognizes common column name variations:
| Task | Expected | Also Accepts |
|---|---|---|
| pairs | anchor |
query, question, sent1, sentence1, text1 |
| pairs | positive |
document, answer, pos, sent2, sentence2, text2 |
| triplets | negative |
neg, hard_negative, sent3, sentence3, text3 |
| similarity | sentence1 |
sent1, text1, anchor, query |
| similarity | sentence2 |
sent2, text2, positive, document |
| similarity | score |
similarity, label, sim_score |
| tsdae | text |
sentence, sentences, content, input |
Important: Columns are matched by name (or alias), not by position. For example, with a pairs task:
[anchor, positive]or[query, document]→ works ✓[document, query]→ still works (names identify roles, not position) ✓[foo, bar]→ fails (no matching column names or aliases) ✗
Columns named score, scores, label, or labels (and aliases like similarity) are treated as labels/targets.
Development
# Run tests
uv run pytest tests/
# Run tests with coverage
uv run pytest tests/ --cov=vespaembed
# Format code
make format
# Lint
make lint
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vespaembed-0.0.5.tar.gz.
File metadata
- Download URL: vespaembed-0.0.5.tar.gz
- Upload date:
- Size: 84.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8215e5421eec2ba8b109d3a310d77e12dbb616e273194cca97dc510b777aed2
|
|
| MD5 |
21676d3c20dda6fad841b1a1ec38bb35
|
|
| BLAKE2b-256 |
408ca461925c3dbf1e88ec0464f950eaae8cf37b1eefb914b83d882e47ca22ea
|
File details
Details for the file vespaembed-0.0.5-py3-none-any.whl.
File metadata
- Download URL: vespaembed-0.0.5-py3-none-any.whl
- Upload date:
- Size: 79.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20b90d44c8459d1528cf0a547478791a21b99f77ad40e1e74a01a6b57ec13f8f
|
|
| MD5 |
bd1920ed513008af5866f7d2dcf05545
|
|
| BLAKE2b-256 |
83ad142f9d3a496c6a2ff3776c4320fad23607cb9e72b9d9993a0503ab0ed6fc
|