AnglE-optimize Text Embeddings

These details have not been verified by PyPI

Project description

AnglE📐: Angle-optimized Text Embeddings

It is Angle 📐, not Angel 👼.

🔥 A New SOTA for Semantic Textual Similarity!

🔥 Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64!

📊 Results on MTEB Leaderboard [click to expand]

📊 Results on STS benchmark [click to expand]

🤗 Pretrained Models

🤗 HF	LoRA Weight	Dependent Backbone	LLM	Language	Prompt	Pooling Strategy	Examples
WhereIsAI/UAE-Large-V1	N	N	N	EN	`Prompts.C` for retrieval purposes, `None` for others	cls
SeanLee97/angle-llama-13b-nli	Y	NousResearch/Llama-2-13b-hf	Y	EN	`Prompts.A`	last token	/
SeanLee97/angle-llama-7b-nli-v2	Y	NousResearch/Llama-2-7b-hf	Y	EN	`Prompts.A`	last token	/
SeanLee97/angle-llama-7b-nli-20231027	Y	NousResearch/Llama-2-7b-hf	Y	EN	`Prompts.A`	last token	/
SeanLee97/angle-bert-base-uncased-nli-en-v1	N	N	N	EN	N	`cls_avg`	/
SeanLee97/angle-roberta-wwm-base-zhnli-v1	N	N	N	ZH-CN	N	`cls`	/
SeanLee97/angle-llama-7b-zhnli-v1	Y	NousResearch/Llama-2-7b-hf	Y	ZH-CN	`Prompts.B`	last token	/

💡 If the selected model is a LoRA weight, it must specify the corresponding dependent backbone.

For our STS Experiment, please refer to https://github.com/SeanLee97/AnglE/tree/main/examples/NLI

Results

English STS Results

Model	STS12	STS13	STS14	STS15	STS16	STSBenchmark	SICKRelatedness	Avg.
SeanLee97/angle-llama-7b-nli-20231027	78.68	90.58	85.49	89.56	86.91	88.92	81.18	85.90
SeanLee97/angle-llama-7b-nli-v2	79.00	90.56	85.79	89.43	87.00	88.97	80.94	85.96
SeanLee97/angle-llama-13b-nli	79.33	90.65	86.89	90.45	87.32	89.69	81.32	86.52
SeanLee97/angle-bert-base-uncased-nli-en-v1	75.09	85.56	80.66	86.44	82.47	85.16	81.23	82.37

Chinese STS Results

Model	ATEC	BQ	LCQMC	PAWSX	STS-B	SOHU-dd	SOHU-dc	Avg.
^shibing624/text2vec-bge-large-chinese	38.41	61.34	71.72	35.15	76.44	71.81	63.15	59.72
^shibing624/text2vec-base-chinese-paraphrase	44.89	63.58	74.24	40.90	78.93	76.70	63.30	63.08
SeanLee97/angle-roberta-wwm-base-zhnli-v1	49.49	72.47	78.33	59.13	77.14	72.36	60.53	67.06
SeanLee97/angle-llama-7b-zhnli-v1	50.44	71.95	78.90	56.57	81.11	68.11	52.02	65.59

^ denotes baselines, their results are retrieved from: https://github.com/shibing624/text2vec

Usage

AnglE supports two APIs, one is the transformers API, the other is the AnglE API. If you want to use the AnglE API, please install AnglE first:

python -m pip install -U angle-emb

UAE

For Retrieval Purposes

For retrieval purposes, please use the prompt Prompts.C.

from angle_emb import AnglE, Prompts

angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
angle.set_prompt(prompt=Prompts.C)
vec = angle.encode({'text': 'hello world'}, to_numpy=True)
print(vec)
vecs = angle.encode([{'text': 'hello world1'}, {'text': 'hello world2'}], to_numpy=True)
print(vecs)

For non-Retrieval Purposes

from angle_emb import AnglE

angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
vec = angle.encode('hello world', to_numpy=True)
print(vec)
vecs = angle.encode(['hello world1', 'hello world2'], to_numpy=True)
print(vecs)

Difference between retrieval and non-retrieval sentence embeddings. [click to expand]

In UAE, we use different approaches for retrieval and non-retrieval tasks, each serving a different purpose.

Retrieval tasks aim to find relevant documents, and as a result, the related documents may not have strict semantic similarities to each other.

For instance, when querying "How about ChatGPT?", the related documents are those that contain information related to "ChatGPT," such as "ChatGPT is amazing..." or "ChatGPT is bad....".

Conversely, non-retrieval tasks, such as semantic textual similarity, require sentences that are semantically similar.

For example, a sentence semantically similar to "How about ChatGPT?" could be "What is your opinion about ChatGPT?".

To distinguish between these two types of tasks, we use different prompts.

For retrieval tasks, we use the prompt "Represent this sentence for searching relevant passages: {text}" (Prompts.C in angle_emb).

For non-retrieval tasks, we set the prompt to empty, i.e., just input your text without specifying a prompt.

So, if your scenario is retrieval-related, it is highly recommended to set the prompt with angle.set_prompt(prompt=Prompts.C). If not, leave the prompt empty or use angle.set_prompt(prompt=None).

Angle-LLaMA

AnglE

from angle_emb import AnglE, Prompts

angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2')

print('All predefined prompts:', Prompts.list_prompts())
angle.set_prompt(prompt=Prompts.A)
print('prompt:', angle.prompt)
vec = angle.encode({'text': 'hello world'}, to_numpy=True)
print(vec)
vecs = angle.encode([{'text': 'hello world1'}, {'text': 'hello world2'}], to_numpy=True)
print(vecs)

transformers

from angle_emb import AnglE
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

peft_model_id = 'SeanLee97/angle-llama-7b-nli-v2'
config = PeftConfig.from_pretrained(peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path).bfloat16().cuda()
model = PeftModel.from_pretrained(model, peft_model_id).cuda()

def decorate_text(text: str):
    return Prompts.A.format(text=text)

inputs = 'hello world!'
tok = tokenizer([decorate_text(inputs)], return_tensors='pt')
for k, v in tok.items():
    tok[k] = v.cuda()
vec = model(output_hidden_states=True, **tok).hidden_states[-1][:, -1].float().detach().cpu().numpy()
print(vec)

Angle-BERT

AnglE

from angle_emb import AnglE

angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', pooling_strategy='cls_avg').cuda()
vec = angle.encode('hello world', to_numpy=True)
print(vec)
vecs = angle.encode(['hello world1', 'hello world2'], to_numpy=True)
print(vecs)

transformers

import torch
from transformers import AutoModel, AutoTokenizer

model_id = 'SeanLee97/angle-bert-base-uncased-nli-en-v1'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).cuda()

inputs = 'hello world!'
tok = tokenizer([inputs], return_tensors='pt')
for k, v in tok.items():
    tok[k] = v.cuda()
hidden_state = model(**tok).last_hidden_state
vec = (hidden_state[:, 0] + torch.mean(hidden_state, dim=1)) / 2.0
print(vec)

Custom Train

Use angle-trainer to train your AnglE model in cli mode. Usage: CUDA_VISIBLE_DEVICES=0 angle-trainer --help
Example

from datasets import load_dataset
from angle_emb import AnglE, AngleDataTokenizer


# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls').cuda()

# 2. load dataset
# `text1`, `text2`, and `label` are three required columns.
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score']})
ds = ds.select_columns(["text1", "text2", "label"])

# 3. transform data
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
test_ds = ds['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)

# 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=valid_ds,
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'w1': 1.0,
        'w2': 1.0,
        'w3': 1.0,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 1.0
    },
    fp16=True,
    logging_steps=100
)

# 5. evaluate
corrcoef, accuracy = angle.evaluate(test_ds, device=angle.device)
print('corrcoef:', corrcoef)

Citation

You are welcome to use our code and pre-trained models. If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}

ChangeLogs

📅	Description
2024 Jan 11	refactor to support `angle-trainer` and BeLLM
2023 Dec 4	Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2	Release an English pretrained model: `SeanLee97/angle-llama-13b-nli`
2023 Oct 28	Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.5.4

Nov 14, 2024

0.5.3

Nov 2, 2024

0.5.2

Oct 19, 2024

0.5.1

Sep 30, 2024

0.5.0

Sep 13, 2024

0.4.12

Jul 30, 2024

0.4.11

Jul 28, 2024

0.4.10

Jul 27, 2024

0.4.9

Jul 26, 2024

0.4.8

Jul 21, 2024

0.4.7

Jul 18, 2024

0.4.6

Jun 28, 2024

0.4.5

Jun 5, 2024

0.4.4

May 29, 2024

0.4.3

May 26, 2024

0.4.2

May 25, 2024

0.4.1

May 22, 2024

0.4.0

May 21, 2024

0.3.10

Mar 25, 2024

0.3.9

Mar 14, 2024

0.3.8

Mar 4, 2024

0.3.7

Feb 28, 2024

0.3.6

Feb 26, 2024

0.3.5

Feb 23, 2024

0.3.4

Feb 23, 2024

0.3.3

Feb 7, 2024

0.3.2

Feb 5, 2024

0.3.1

Jan 15, 2024

0.3.0

Jan 15, 2024

0.2.3

Jan 15, 2024

0.2.2

Jan 12, 2024

0.2.1

Jan 12, 2024

This version

0.2.0

Jan 11, 2024

0.1.6

Dec 24, 2023

0.1.5

Dec 9, 2023

0.1.4

Dec 7, 2023

0.1.3

Dec 4, 2023

0.1.2

Oct 29, 2023

0.1.1

Oct 22, 2023

0.1.0

Oct 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

angle_emb-0.2.0.tar.gz (34.4 kB view details)

Uploaded Jan 11, 2024 Source

Built Distribution

angle_emb-0.2.0-py3-none-any.whl (35.5 kB view details)

Uploaded Jan 11, 2024 Python 3

File details

Details for the file angle_emb-0.2.0.tar.gz.

File metadata

Download URL: angle_emb-0.2.0.tar.gz
Upload date: Jan 11, 2024
Size: 34.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for angle_emb-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6053b3f554a6e5c472998dae1a4a9b7fedf14188504483d525906dbfacb37329`
MD5	`8e4ad2cfc00af0228f895b631ab7d4ab`
BLAKE2b-256	`af02041bdd051b432fec78e359a2332820457f1fc5f61d6480390e45faf8154d`

See more details on using hashes here.

File details

Details for the file angle_emb-0.2.0-py3-none-any.whl.

File metadata

Download URL: angle_emb-0.2.0-py3-none-any.whl
Upload date: Jan 11, 2024
Size: 35.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for angle_emb-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91c2d7b3e85a9bdd5f2d45cde36f217a7a0672fe2dcda8eb1a4e672cbefd13f5`
MD5	`5b83a87e349a069b928136af70526aa6`
BLAKE2b-256	`3f114c5e86813b2d70ca064019426ab39f674bbfa23b1472bcee63ecd171b25d`