Skip to main content

AnglE-optimize Text Embeddings

Project description

EN | 简体中文

AnglE 📐

Sponsored by Mixedbread

For more detailed usage, please read the 📘 document: https://angle.readthedocs.io/en/latest/index.html

https://arxiv.org/abs/2309.12871 PyPI version PyPI Downloads Read the docs

📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

✨ Features

Loss:

  • 📐 AnglE loss (ACL24)
  • ⚖ Contrastive loss
  • 📏 CoSENT loss
  • ☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)

Backbones:

  • BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
  • LLM-based models (LLaMA, Mistral, Qwen, etc.)
  • Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

  • Single-GPU training
  • Multi-GPU training

http://makeapullrequest.com More features will be added in the future.

🏆 Achievements

📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

🤗 Official Pretrained Models

BERT-based models:

🤗 HF Max Tokens Pooling Strategy Scenario
WhereIsAI/UAE-Large-V1 512 cls English, General-purpose
WhereIsAI/UAE-Code-Large-V1 512 cls Code Similarity
WhereIsAI/pubmed-angle-base-en 512 cls Medical Similarity
WhereIsAI/pubmed-angle-large-en 512 cls Medical Similarity

LLM-based models:

🤗 HF (lora weight) Backbone Max Tokens Prompts Pooling Strategy Scenario
SeanLee97/angle-llama-13b-nli NousResearch/Llama-2-13b-hf 4096 Prompts.A last token English, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2 NousResearch/Llama-2-7b-hf 4096 Prompts.A last token English, Similarity Measurement

💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection

🚀 Quick Start

⬇️ Installation

use uv

uv pip install -U angle-emb

or pip

pip install -U angle-emb

🔍 Inference

1️⃣ BERT-based Models

Open In Colab

Option A: With Prompts (for Retrieval Tasks)

Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().

from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode query with prompt, documents without prompt
qv = angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
    'The weather is great!',
    'it is rainy today.',
    'i am going to bed'
], to_numpy=True)

# Calculate similarity
for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))

Option B: Without Prompts (for Similarity Tasks)

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode documents
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
])

# Calculate pairwise similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

2️⃣ LLM-based Models

Open In Colab

For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load LLM with LoRA weights
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
    pooling_strategy='last',
    is_llm=True,
    torch_dtype=torch.float16
).cuda()

# Encode with prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt=Prompts.A)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

3️⃣ BiLLM-based Models

Open In Colab

Enable bidirectional LLMs with apply_billm=True and specify the model class.

import os
import torch
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Set BiLLM environment variable
os.environ['BiLLM_START_INDEX'] = '31'

# Load BiLLM model
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
    pooling_strategy='last',
    is_llm=True,
    apply_billm=True,
    billm_model_class='LlamaForCausalLM',
    torch_dtype=torch.float16
).cuda()

# Encode with custom prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt='The representative word for sentence {text} is:"')

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

4️⃣ Espresso/Matryoshka Models

Open In Colab

Truncate layers and embedding dimensions for flexible model compression.

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()

# Truncate to specific layer
angle = angle.truncate_layer(layer_index=22)

# Encode with truncated embedding size
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], embedding_size=768)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

5️⃣ Third-party Models

Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.

from angle_emb import AnglE

# Load third-party model
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()

# Encode text
vec = model.encode('hello world', to_numpy=True)
print(vec)

⚡ Batch Inference

Speed up inference with the batched library (recommended for large-scale processing).

uv pip install batched
import batched
from angle_emb import AnglE

# Load model
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()

# Enable dynamic batching
model.encode = batched.dynamically(model.encode, batch_size=64)

# Encode large batch
vecs = model.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
] * 50)

🕸️ Custom Training

💡 For complete details, see the official training documentation.


🗂️ Step 1: Prepare Your Dataset

AnglE supports three dataset formats. Choose based on your task:

Format Columns Description Use Case
Format A text1, text2, label Paired texts with similarity scores (0-1) Similarity scoring
Format B query, positive Query-document pairs Retrieval without hard negatives
Format C query, positive, negative Query with positive and negative samples Contrastive learning

Notes:

  • All formats use HuggingFace datasets.Dataset
  • text1, text2, query, positive, and negative can be str or List[str] (random sampling for lists)

🚂 Step 2: Training Methods

Option A: CLI Training (Recommended)

Single GPU:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-GPU with FSDP:

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  --config_file examples/FSDP/fsdp_config.yaml \
  -m angle_emb.angle_trainer \
  --gradient_checkpointing 1 \
  --use_reentrant 0 \
  ...

Multi-GPU (Standard):

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  -m angle_emb.angle_trainer \
  --model_name_or_path YOUR_MODEL \
  --train_name_or_path YOUR_DATASET \
  ...

📁 More examples: examples/Training


Option B: Python API Training

Open In Colab

from datasets import load_dataset
from angle_emb import AnglE

# Step 1: Load pretrained model
angle = AnglE.from_pretrained(
    'SeanLee97/angle-bert-base-uncased-nli-en-v1',
    max_length=128,
    pooling_strategy='cls'
).cuda()

# Step 2: Prepare dataset (Format A example)
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {
    "text1": str(obj["sentence1"]),
    "text2": str(obj['sentence2']),
    "label": obj['score']
})
ds = ds.select_columns(["text1", "text2", "label"])

# Step 3: Train the model
angle.fit(
    train_ds=ds['train'].shuffle(),
    valid_ds=ds['validation'],
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'cosine_w': 1.0,
        'ibn_w': 1.0,
        'angle_w': 0.02,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 20
    },
    fp16=True,
    logging_steps=100
)

# Step 4: Evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)

⚙️ Advanced Configuration

Training Special Models

Model Type CLI Flags Description
LLM --is_llm 1 + LoRA params Must manually enable LLM mode
BiLLM --apply_billm 1 --billm_model_class LlamaForCausalLM Bidirectional LLMs (guide)
Espresso (ESE) --apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256 Matryoshka-style embeddings

Applying Prompts

Format Flag Applies To
Format A --text_prompt "text: {text}" Both text1 and text2
Format B/C --query_prompt "query: {text}" query field
Format B/C --doc_prompt "document: {text}" positive and negative fields

Column Mapping (Legacy Compatibility)

Adapt old datasets without modification:

# CLI
--column_rename_mapping "text:query"

# Python
column_rename_mapping={"text": "query"}

Model Conversion

Convert trained models to sentence-transformers format:

python scripts/convert_to_sentence_transformers.py --help

💡 Fine-tuning Tips

📖 Full documentation

Format Recommendation
Format A Increase cosine_w or decrease ibn_w
Format B Only tune ibn_w and ibn_tau
Format C Set cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w

Prevent Catastrophic Forgetting:

  • Set teacher_name_or_path for knowledge distillation
  • Use same model path for self-distillation
  • ⚠️ Ensure teacher and student use the same tokenizer

🔄 Integration with sentence-transformers

Task Status Notes
Training ⚠️ Partial SentenceTransformers has AnglE loss, but use official angle_emb for best results
Inference ✅ Full Convert trained models: examples/convert_to_sentence_transformers.py

🫡 Citation

If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}

📜 ChangeLogs

📅 Description
2025 Jan v0.6.0 - Major refactoring 🎉:
• Removed AngleDataTokenizer - no need to pre-tokenize datasets!
• Removed DatasetFormats class - use string literals ('A', 'B', 'C')
• Removed auto-detection of LLM models - set is_llm manually
• Renamed --prompt_template to --text_prompt (Format A only)
• Added --query_prompt and --doc_prompt for Format B/C
• Added --column_rename_mapping to adapt old datasets without modification
• Updated data formats: Format B/C now use query, positive, negative fields
• Support list-based sampling in Format B/C
• Updated examples to use accelerate launch
• See MIGRATION_GUIDE.md for upgrade instructions
2024 May 21 support Espresso Sentence Embeddings
2024 Feb 7 support training with only positive pairs (Format C: query, positive)
2023 Dec 4 Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2 Release an English pretrained model: SeanLee97/angle-llama-13b-nli
2023 Oct 28 Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md

📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com

© License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

angle_emb-0.6.0.tar.gz (34.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

angle_emb-0.6.0-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file angle_emb-0.6.0.tar.gz.

File metadata

  • Download URL: angle_emb-0.6.0.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for angle_emb-0.6.0.tar.gz
Algorithm Hash digest
SHA256 54afec4ba897bfc055968b44f6afa82b27ab5ca09f513b15f22ec9b349925ce4
MD5 f4e07ba8f305bf35f6c3dfcfc927b7eb
BLAKE2b-256 767a1a6ec0641daa86e95e2fa6c14a2091e7858bbb74c435513e4c01f7a715f0

See more details on using hashes here.

File details

Details for the file angle_emb-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: angle_emb-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for angle_emb-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f655a2fbd940531a8e50e65b0f65d030b828b4fd4011a2e64302b2c4ad9e0a6e
MD5 327d16c58eaf9b39285b5b36aae2a441
BLAKE2b-256 297e63d38f4aaa79b9bd52c0a9d2bf6f7a7883d2f8da916dc9f9e983334cf818

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page