AnglE-optimize Text Embeddings
Project description
EN | 简体中文
AnglE 📐
Sponsored by Mixedbread
For more detailed usage, please read the 📘 document: https://angle.readthedocs.io/en/latest/index.html
📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.
✨ Features
Loss:
- 📐 AnglE loss (ACL24)
- ⚖ Contrastive loss
- 📏 CoSENT loss
- ☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)
Backbones:
- BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
- LLM-based models (LLaMA, Mistral, Qwen, etc.)
- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)
Training:
- Single-GPU training
- Multi-GPU training
🏆 Achievements
📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.
📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.
📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!
📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.
📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!
🤗 Official Pretrained Models
BERT-based models:
| 🤗 HF | Max Tokens | Pooling Strategy | Scenario |
|---|---|---|---|
| WhereIsAI/UAE-Large-V1 | 512 | cls | English, General-purpose |
| WhereIsAI/UAE-Code-Large-V1 | 512 | cls | Code Similarity |
| WhereIsAI/pubmed-angle-base-en | 512 | cls | Medical Similarity |
| WhereIsAI/pubmed-angle-large-en | 512 | cls | Medical Similarity |
LLM-based models:
| 🤗 HF (lora weight) | Backbone | Max Tokens | Prompts | Pooling Strategy | Scenario |
|---|---|---|---|---|---|
| SeanLee97/angle-llama-13b-nli | NousResearch/Llama-2-13b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
| SeanLee97/angle-llama-7b-nli-v2 | NousResearch/Llama-2-7b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection
🚀 Quick Start
⬇️ Installation
use uv
uv pip install -U angle-emb
or pip
pip install -U angle-emb
🔍 Inference
1️⃣ BERT-based Models
Option A: With Prompts (for Retrieval Tasks)
Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# Encode query with prompt, documents without prompt
qv = angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
'The weather is great!',
'it is rainy today.',
'i am going to bed'
], to_numpy=True)
# Calculate similarity
for dv in doc_vecs:
print(cosine_similarity(qv[0], dv))
Option B: Without Prompts (for Similarity Tasks)
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# Encode documents
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
])
# Calculate pairwise similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
2️⃣ LLM-based Models
For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.
import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
# Load LLM with LoRA weights
angle = AnglE.from_pretrained(
'NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
pooling_strategy='last',
is_llm=True,
torch_dtype=torch.float16
).cuda()
# Encode with prompt
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], prompt=Prompts.A)
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
3️⃣ BiLLM-based Models
Enable bidirectional LLMs with apply_billm=True and specify the model class.
import os
import torch
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Set BiLLM environment variable
os.environ['BiLLM_START_INDEX'] = '31'
# Load BiLLM model
angle = AnglE.from_pretrained(
'NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
pooling_strategy='last',
is_llm=True,
apply_billm=True,
billm_model_class='LlamaForCausalLM',
torch_dtype=torch.float16
).cuda()
# Encode with custom prompt
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], prompt='The representative word for sentence {text} is:"')
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
4️⃣ Espresso/Matryoshka Models
Truncate layers and embedding dimensions for flexible model compression.
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()
# Truncate to specific layer
angle = angle.truncate_layer(layer_index=22)
# Encode with truncated embedding size
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], embedding_size=768)
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
5️⃣ Third-party Models
Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.
from angle_emb import AnglE
# Load third-party model
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()
# Encode text
vec = model.encode('hello world', to_numpy=True)
print(vec)
⚡ Batch Inference
Speed up inference with the batched library (recommended for large-scale processing).
uv pip install batched
import batched
from angle_emb import AnglE
# Load model
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()
# Enable dynamic batching
model.encode = batched.dynamically(model.encode, batch_size=64)
# Encode large batch
vecs = model.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
] * 50)
🕸️ Custom Training
💡 For complete details, see the official training documentation.
🗂️ Step 1: Prepare Your Dataset
AnglE supports three dataset formats. Choose based on your task:
| Format | Columns | Description | Use Case |
|---|---|---|---|
| Format A | text1, text2, label |
Paired texts with similarity scores (0-1) | Similarity scoring |
| Format B | query, positive |
Query-document pairs | Retrieval without hard negatives |
| Format C | query, positive, negative |
Query with positive and negative samples | Contrastive learning |
Notes:
- All formats use HuggingFace
datasets.Dataset text1,text2,query,positive, andnegativecan bestrorList[str](random sampling for lists)
🚂 Step 2: Training Methods
Option A: CLI Training (Recommended)
Single GPU:
CUDA_VISIBLE_DEVICES=0 angle-trainer --help
Multi-GPU with FSDP:
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 4 \
--main_process_port 2345 \
--config_file examples/FSDP/fsdp_config.yaml \
-m angle_emb.angle_trainer \
--gradient_checkpointing 1 \
--use_reentrant 0 \
...
Multi-GPU (Standard):
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 4 \
--main_process_port 2345 \
-m angle_emb.angle_trainer \
--model_name_or_path YOUR_MODEL \
--train_name_or_path YOUR_DATASET \
...
📁 More examples: examples/Training
Option B: Python API Training
from datasets import load_dataset
from angle_emb import AnglE
# Step 1: Load pretrained model
angle = AnglE.from_pretrained(
'SeanLee97/angle-bert-base-uncased-nli-en-v1',
max_length=128,
pooling_strategy='cls'
).cuda()
# Step 2: Prepare dataset (Format A example)
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {
"text1": str(obj["sentence1"]),
"text2": str(obj['sentence2']),
"label": obj['score']
})
ds = ds.select_columns(["text1", "text2", "label"])
# Step 3: Train the model
angle.fit(
train_ds=ds['train'].shuffle(),
valid_ds=ds['validation'],
output_dir='ckpts/sts-b',
batch_size=32,
epochs=5,
learning_rate=2e-5,
save_steps=100,
eval_steps=1000,
warmup_steps=0,
gradient_accumulation_steps=1,
loss_kwargs={
'cosine_w': 1.0,
'ibn_w': 1.0,
'angle_w': 0.02,
'cosine_tau': 20,
'ibn_tau': 20,
'angle_tau': 20
},
fp16=True,
logging_steps=100
)
# Step 4: Evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)
⚙️ Advanced Configuration
Training Special Models
| Model Type | CLI Flags | Description |
|---|---|---|
| LLM | --is_llm 1 + LoRA params |
Must manually enable LLM mode |
| BiLLM | --apply_billm 1 --billm_model_class LlamaForCausalLM |
Bidirectional LLMs (guide) |
| Espresso (ESE) | --apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256 |
Matryoshka-style embeddings |
Applying Prompts
| Format | Flag | Applies To |
|---|---|---|
| Format A | --text_prompt "text: {text}" |
Both text1 and text2 |
| Format B/C | --query_prompt "query: {text}" |
query field |
| Format B/C | --doc_prompt "document: {text}" |
positive and negative fields |
Column Mapping (Legacy Compatibility)
Adapt old datasets without modification:
# CLI
--column_rename_mapping "text:query"
# Python
column_rename_mapping={"text": "query"}
Model Conversion
Convert trained models to sentence-transformers format:
python scripts/convert_to_sentence_transformers.py --help
💡 Fine-tuning Tips
| Format | Recommendation |
|---|---|
| Format A | Increase cosine_w or decrease ibn_w |
| Format B | Only tune ibn_w and ibn_tau |
| Format C | Set cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w |
Prevent Catastrophic Forgetting:
- Set
teacher_name_or_pathfor knowledge distillation - Use same model path for self-distillation
- ⚠️ Ensure teacher and student use the same tokenizer
🔄 Integration with sentence-transformers
| Task | Status | Notes |
|---|---|---|
| Training | ⚠️ Partial | SentenceTransformers has AnglE loss, but use official angle_emb for best results |
| Inference | ✅ Full | Convert trained models: examples/convert_to_sentence_transformers.py |
🫡 Citation
If you use our code and pre-trained models, please support us by citing our work as follows:
@article{li2023angle,
title={AnglE-optimized Text Embeddings},
author={Li, Xianming and Li, Jing},
journal={arXiv preprint arXiv:2309.12871},
year={2023}
}
📜 ChangeLogs
| 📅 | Description |
|---|---|
| 2025 Jan | v0.6.0 - Major refactoring 🎉: • Removed AngleDataTokenizer - no need to pre-tokenize datasets!• Removed DatasetFormats class - use string literals ('A', 'B', 'C')• Removed auto-detection of LLM models - set is_llm manually• Renamed --prompt_template to --text_prompt (Format A only)• Added --query_prompt and --doc_prompt for Format B/C• Added --column_rename_mapping to adapt old datasets without modification• Updated data formats: Format B/C now use query, positive, negative fields• Support list-based sampling in Format B/C • Updated examples to use accelerate launch• See MIGRATION_GUIDE.md for upgrade instructions |
| 2024 May 21 | support Espresso Sentence Embeddings |
| 2024 Feb 7 | support training with only positive pairs (Format C: query, positive) |
| 2023 Dec 4 | Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1 |
| 2023 Nov 2 | Release an English pretrained model: SeanLee97/angle-llama-13b-nli |
| 2023 Oct 28 | Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md |
📧 Contact
If you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com
© License
This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file angle_emb-0.6.0.tar.gz.
File metadata
- Download URL: angle_emb-0.6.0.tar.gz
- Upload date:
- Size: 34.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54afec4ba897bfc055968b44f6afa82b27ab5ca09f513b15f22ec9b349925ce4
|
|
| MD5 |
f4e07ba8f305bf35f6c3dfcfc927b7eb
|
|
| BLAKE2b-256 |
767a1a6ec0641daa86e95e2fa6c14a2091e7858bbb74c435513e4c01f7a715f0
|
File details
Details for the file angle_emb-0.6.0-py3-none-any.whl.
File metadata
- Download URL: angle_emb-0.6.0-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f655a2fbd940531a8e50e65b0f65d030b828b4fd4011a2e64302b2c4ad9e0a6e
|
|
| MD5 |
327d16c58eaf9b39285b5b36aae2a441
|
|
| BLAKE2b-256 |
297e63d38f4aaa79b9bd52c0a9d2bf6f7a7883d2f8da916dc9f9e983334cf818
|