Skip to main content

The official llm2vec library

Project description

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

arxiv PyPi HF

LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) training with masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.

LLM2Vec_figure1

**************************** Updates ****************************

Installation

To use LLM2Vec, first install the llm2vec package from PyPI, followed by installing flash-attention:

pip install llm2vec
pip install flash-attn --no-build-isolation

You can also directly install the latest version of llm2vec by cloning the repository:

pip install -e .
pip install flash-attn --no-build-isolation

Getting Started

LLM2Vec class is a wrapper on top of HuggingFace models to support enabling bidirectionality in decoder-only LLMs, sequence encoding and pooling operations. The steps below showcase an example on how to use the library.

Preparing the model

Initializing LLM2Vec model using pretrained LLMs is straightforward. The from_pretrained method of LLM2Vec takes a base model identifier/path and an optional PEFT model identifier/path. All HuggingFace model loading arguments can be passed to from_pretrained method. By default, the models are loaded with bidirectional connections enabled. This can be turned off by passing enable_bidirectional=False to the from_pretrained method.

Here, we first initialize the Mistral MNTP base model and load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus).

import torch
from llm2vec import LLM2Vec

l2v = LLM2Vec.from_pretrained(
    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

We can also load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data) by changing the peft_model_name_or_path.

import torch
from llm2vec import LLM2Vec

l2v = LLM2Vec.from_pretrained(
    "McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",
    peft_model_name_or_path="McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

By default the LLM2Vec model uses the mean pooling strategy. You can change the pooling strategy by passing the pooling_mode argument to the from_pretrained method. Similarly, you can change the maximum sequence length by passing the max_length argument (default is 512).

Inference

This model now returns the text embedding for any input in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2]. While training, we provide instructions for both sentences in symmetric tasks, and only for for queries in asymmetric tasks.

# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.5485, 0.0551],
        [0.0565, 0.5425]])
"""

More examples of classification, clustering, sentence similarity etc are present in examples directory.

Model List

Training

MNTP training

To train the model with Masked Next Token Prediction (MNTP), you can use the experiments/run_mntp.py script. It is adapted from HuggingFace Masked Language Modeling (MLM) script. To train the Mistral-7B model with MNTP, run the following command:

python experiments/run_mntp.py train_configs/mntp/Mistral.json

The Mistral training configuration file contains all the training hyperparameters and configurations used in our paper.

{
    "model_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
    "dataset_name": "wikitext",
    "dataset_config_name": "wikitext-103-raw-v1",
    "mask_token_type": "blank",
    "data_collator_type": "all_mask",
    "mlm_probability": 0.8,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2"
    // ....
}

Similar configurations are also available for Llama-2-7B and Sheared-Llama-1.3B models.

Citation

If you find our work helpful, please cite us:

@article{llm2vec,
      title={{LLM2Vec}: {L}arge Language Models Are Secretly Powerful Text Encoders}, 
      author={Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy},
      year={2024},
      journal={arXiv preprint},
      url={https://arxiv.org/abs/2404.05961}
}

Bugs or questions?

If you have any questions about the code, feel free to open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm2vec-0.1.5.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

llm2vec-0.1.5-py2.py3-none-any.whl (23.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file llm2vec-0.1.5.tar.gz.

File metadata

  • Download URL: llm2vec-0.1.5.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for llm2vec-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b7c9be89da9326c884de2a92b733075339091816100e073068316d8ac3488308
MD5 c9aecf99569c032fe357a9f77861e7d3
BLAKE2b-256 f0a3bc6fe974af19cdd9a009b4bb51574adc364c31028ced8bc0f402e14315f7

See more details on using hashes here.

File details

Details for the file llm2vec-0.1.5-py2.py3-none-any.whl.

File metadata

  • Download URL: llm2vec-0.1.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for llm2vec-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 10c80947bf54c952ca6a8c93d152261132c57cda60504a71017c8d42dedc092f
MD5 5d9e5d5e7b639baba48782f209f16198
BLAKE2b-256 362d9e6f443bd789dcc86c16023822409ab11829d0dc86fc27faab8fadb74e53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page