AnglE-optimize Text Embeddings
Project description
EN | 简体中文
AnglE 📐
Sponsored by Mixedbread
For more detailed usage, please read the 📘 document: https://angle.readthedocs.io/en/latest/index.html
📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.
✨ Features
Loss:
- 📐 AnglE loss
- ⚖ Contrastive loss
- 📏 CoSENT loss
- ☕️ Espresso loss (previously known as 2DMSE, detail: README_ESE)
Backbones:
- BERT-based models (BERT, RoBERTa, ELECTRA, ALBERT, etc.)
- LLM-based models (LLaMA, Mistral, Qwen, etc.)
- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)
Training:
- Single-GPU training
- Multi-GPU training
🏆 Achievements
📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.
📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.
📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!
📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.
📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!
🤗 Official Pretrained Models
BERT-based models:
🤗 HF | Max Tokens | Pooling Strategy | Scenario |
---|---|---|---|
WhereIsAI/UAE-Large-V1 | 512 | cls | English, General-purpose |
WhereIsAI/UAE-Code-Large-V1 | 512 | cls | Code Similarity |
WhereIsAI/pubmed-angle-base-en | 512 | cls | Medical Similarity |
WhereIsAI/pubmed-angle-large-en | 512 | cls | Medical Similarity |
LLM-based models:
🤗 HF (lora weight) | Backbone | Max Tokens | Prompts | Pooling Strategy | Scenario |
---|---|---|---|---|---|
SeanLee97/angle-llama-13b-nli | NousResearch/Llama-2-13b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
SeanLee97/angle-llama-7b-nli-v2 | NousResearch/Llama-2-7b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection
🚀 Quick Start
⬇️ Installation
python -m pip install -U angle-emb
⌛ Infer BERT-based Model
- With Prompts: You can specify a prompt with
prompt=YOUR_PROMPT
inencode
method. If set a prompt, the inputs should be a list of dict or a single dict with keytext
, wheretext
is the placeholder in the prompt for the input text. You can use other placeholder names. We provide a set of predefined prompts inPrompts
class, you can check them viaPrompts.list_prompts()
.
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).
# When specify prompt, the inputs should be a list of dict with key 'text'
qv = angle.encode({'text': 'what is the weather?'}, to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
'The weather is great!',
'it is rainy today.',
'i am going to bed'
], to_numpy=True)
for dv in doc_vecs:
print(cosine_similarity(qv[0], dv))
- Without Prompts: no need to specify a prompt. Just input a list of strings or a single string.
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# for non-retrieval tasks, we don't need to specify prompt when using UAE-Large-V1.
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
])
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
⌛ Infer LLM-based Models
If the pretrained weight is a LoRA-based model, you need to specify the backbone via model_name_or_path
and specify the LoRA path via the pretrained_lora_path
in from_pretrained
method.
import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
pooling_strategy='last',
is_llm=True,
torch_dtype=torch.float16).cuda()
print('All predefined prompts:', Prompts.list_prompts())
doc_vecs = angle.encode([
{'text': 'The weather is great!'},
{'text': 'The weather is very good!'},
{'text': 'i am going to bed'}
], prompt=Prompts.A)
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
⌛ Infer BiLLM-based Models
Specify apply_billm
and billm_model_class
to load and infer billm models
import os
# set an environment variable for billm start index
os.environ['BiLLM_START_INDEX'] = '31'
import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
# specify `apply_billm` and `billm_model_class` to load billm models
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
pooling_strategy='last',
is_llm=True,
apply_billm=True,
billm_model_class='LlamaForCausalLM',
torch_dtype=torch.float16).cuda()
doc_vecs = angle.encode([
{'text': 'The weather is great!'},
{'text': 'The weather is very good!'},
{'text': 'i am going to bed'}
], prompt='The representative word for sentence {text} is:"')
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
⌛ Infer Espresso/Matryoshka Models
Specify layer_index
and embedding_size
to truncate embeddings.
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()
# truncate layer
angle = angle.truncate_layer(layer_index=22)
# specify embedding size to truncate embeddings
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], embedding_size=768)
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))
⌛ Infer Third-party Models
You can load any transformer-based third-party models such as mixedbread-ai/mxbai-embed-large-v1
, sentence-transformers/all-MiniLM-L6-v2
, and BAAI/bge-large-en-v1.5
using angle_emb
.
Here is an example:
from angle_emb import AnglE
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()
vec = model.encode('hello world', to_numpy=True)
print(vec)
Batch Inference
It is recommended to use Mixedbread's batched
library to speed up the inference process.
python -m pip install batched
import batched
from angle_emb import AnglE
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()
model.encode = batched.dynamically(model.encode, batch_size=64)
vecs = model.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
] * 50)
🕸️ Custom Train
💡 For more details, please refer to the training and fintuning.
🗂️ 1. Data Prepation
We currently support three dataset formats:
-
DatasetFormats.A
: it is a pair format with three columns:text1
,text2
, andlabel
(0/1). -
DatasetFormats.B
: it is a triple format with three columns:text
,positive
, andnegative
.positive
andnegative
store the positive and negative samples oftext
. -
DatasetFormats.C
: it is a pair format with two columns:text
,positive
.positive
store the positive sample oftext
.
You need to prepare your data into huggingface datasets.Dataset
in one of the formats in terms of your supervised data.
🚂 2. Train with CLI [Recommended]
Use angle-trainer
to train your AnglE model in cli mode.
- Single gpu training:
Usage:
CUDA_VISIBLE_DEVICES=0 angle-trainer --help
- Multi-gpu training:
Usage:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 -m angle_emb.angle_trainer --help
🚂 3. Custom Train
from datasets import load_dataset
from angle_emb import AnglE, AngleDataTokenizer
# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls').cuda()
# 2. load dataset
# `text1`, `text2`, and `label` are three required columns.
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score']})
ds = ds.select_columns(["text1", "text2", "label"])
# 3. transform data
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
# 4. fit
angle.fit(
train_ds=train_ds,
valid_ds=valid_ds,
output_dir='ckpts/sts-b',
batch_size=32,
epochs=5,
learning_rate=2e-5,
save_steps=100,
eval_steps=1000,
warmup_steps=0,
gradient_accumulation_steps=1,
loss_kwargs={
'cosine_w': 1.0,
'ibn_w': 20.0,
'angle_w': 1.0,
'cosine_tau': 20,
'ibn_tau': 20,
'angle_tau': 20
},
fp16=True,
logging_steps=100
)
# 5. evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)
💡 Others
- To enable
llm
training, please specify--is_llm 1
and configure appropriate LoRA hyperparameters. - To enable
billm
training, please specify--apply_billm 1
and configure appropriatebillm_model_class
such asLLamaForCausalLM
(refer to: https://github.com/WhereIsAI/BiLLM?tab=readme-ov-file#usage). - To enable espresso sentence embeddings (ESE), please specify
--apply_ese 1
and configure appropriate ESE hyperparameters via--ese_kl_temperature float
and--ese_compression_size integer
. - To convert the trained AnglE models to
sentence-transformers
, please runpython scripts/convert_to_sentence_transformers.py --help
for more details.
💡 4. Fine-tuning Tips
1️⃣ If your dataset format is DatasetFormats.A
, it is recommended to slightly increase the weight for cosine_w
or slightly decrease the weight for ibn_w
.
2️⃣ If your dataset format is DatasetFormats.B
, it is recommended to set cosine_w
to 0, and increase the weight for ibn_w
such as 10 and 20. The angle_tau
is recommended to set to 20.0.
3️⃣ If your dataset format is DatasetFormats.C
, only ibn_w
and ibn_tau
are effective. You don't need to tune other parameters.
4️⃣ To alleviate information forgetting in fine-tuning, it is better to specify the teacher_name_or_path
. If the teacher_name_or_path
equals model_name_or_path
, it will conduct self-distillation. It is worth to note that teacher_name_or_path
has to have the same tokenizer as model_name_or_path
. Or it will lead to unexpected results.
5. Finetuning and Infering AnglE with sentence-transformers
-
Training: SentenceTransformers also provides a implementation of AnglE loss. But it is partially implemented and may not work well as the official code. We recommend to use the official
angle_emb
for fine-tuning AnglE model. -
Infering: If your model is trained with
angle_emb
, and you want to use it withsentence-transformers
. You can convert it tosentence-transformers
model using the scriptexamples/convert_to_sentence_transformers.py
.
🫡 Citation
You are welcome to use our code and pre-trained models. If you use our code and pre-trained models, please support us by citing our work as follows:
@article{li2023angle,
title={AnglE-optimized Text Embeddings},
author={Li, Xianming and Li, Jing},
journal={arXiv preprint arXiv:2309.12871},
year={2023}
}
📜 ChangeLogs
📅 | Description |
---|---|
2024 May 21 | support Espresso Sentence Embeddings |
2024 Feb 7 | support training with only positive pairs (DatasetFormats.C ) |
2023 Dec 4 | Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1 |
2023 Nov 2 | Release an English pretrained model: SeanLee97/angle-llama-13b-nli |
2023 Oct 28 | Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1 ; Add chinese README.md |
📧 Contact
If you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com
© License
This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file angle_emb-0.5.1.tar.gz
.
File metadata
- Download URL: angle_emb-0.5.1.tar.gz
- Upload date:
- Size: 33.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 187c05c8631c4edcf5cf47722472cb87b89f0e28e8a8613dc647cbc8ec68f913 |
|
MD5 | 3a7f39f044bd87d3e70d219a7ee06c75 |
|
BLAKE2b-256 | 96d8414f3f459b5bc983d7cd04d5b30339c9997c555732c5344323152a04e481 |
File details
Details for the file angle_emb-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: angle_emb-0.5.1-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0a079d4e4d449c02c72ef69166d26387c9454b1e94cebd5f3eb2e623e10fb56 |
|
MD5 | 65d66a337b2bae243d4df71796af04e7 |
|
BLAKE2b-256 | c50475223049ffe4a6be3bf3455b0d5938f39c6c9f1dda5aa0c277ee77dac83a |