Skip to main content

Embedding extractor for decoder-only LLMs

Project description

PyPI - Python Version PyPI - Package Version License

llemb: Unified Embedding Extraction from Decoder-only LLMs

llemb is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.

With llemb, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like PromptEOL and PCoTEOL, with built-in support for quantization to run on consumer hardware.

Features

  • Flexible Backends: Seamless support for Hugging Face Transformers.
  • Advanced Pooling Strategies:
    • Standard: mean, last_token, eos_token
    • Research-grade: prompt_eol, pcoteol (Pretended Chain of Thought), ke (Knowledge Enhancement)
  • Efficient Inference: Native support for 4-bit and 8-bit quantization via bitsandbytes.
  • Granular Control: Extract embeddings from any layer (defaults to recommended layers based on research).

Installation

Install via PyPI using pip or uv.

Basic Installation

pip install llemb
# or
uv add llemb

With Quantization Support

To enable 4-bit/8-bit quantization (recommended for large models):

pip install "llemb[quantization]"
# or
uv add llemb[quantization]

Getting Started

Initialize the encoder and start extracting embeddings in just a few lines of code.

Basic Usage

import llemb

# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling="mean")

print(embeddings.shape)
# => (1, 4096)

Advanced Usage (Quantization & Research Strategies)

Use quantization to reduce memory usage and apply advanced pooling strategies like pcoteol for better representation.

import llemb

# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    backend="transformers",
    device="cuda",
    quantization="4bit"
)

# Extract using "Pretended Chain of Thought" strategy
# Note: Automatically uses the second-to-last layer (layer -2) as recommended
embeddings = enc.encode("Hello world", pooling="pcoteol")

Configuration & Optimization

llemb passes arguments directly to the backend, allowing for deep customization.

Using Flash Attention 2

import torch

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Custom Quantization Config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)

Supported Pooling Strategies

Strategy Description Recommended Layer
mean Average pooling of all tokens (excluding padding). -1 (Last)
last_token Vector of the last generated token. -1 (Last)
eos_token Vector corresponding to the EOS token position. -1 (Last)
prompt_eol Embeddings extracted using a prompt template targeting the last token. -1 (Last)
pcoteol "Pretended Chain of Thought" - wraps input in a reasoning template. -2
ke "Knowledge Enhancement" - wraps input in a context-aware template. -2

Development

Clone the repository and sync dependencies using uv:

git clone [https://github.com/j341nono/llemb.git](https://github.com/j341nono/llemb.git)
cd llemb
uv sync --all-extras --dev

Run Tests

uv run pytest

Static Analysis

uv run ruff check src
uv run mypy src

Citations

If you use the advanced pooling strategies implemented in this library, please cite the respective original papers:

PromptEOL:

@inproceedings{jiang-etal-2024-scaling,
    title = "Scaling Sentence Embeddings with Large Language Models",
    author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    year = "2024"
}

PCoTEOL and KE:

@article{zhang2024simple,
    title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
    author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
    journal={arXiv preprint arXiv:2404.03921},
    year={2024}
}

License

This project is open source and available under the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llemb-0.2.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llemb-0.2.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file llemb-0.2.0.tar.gz.

File metadata

  • Download URL: llemb-0.2.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llemb-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9de246a68d4da66f1223aa4db2405a50a0a2688c434802f04f521424d3ebd7d8
MD5 349abab6b92ec9631b98c523c2edd310
BLAKE2b-256 11d36808c0383116fdf02f3480b947f9c877a64465d86b596fbcdcfbc2f56319

See more details on using hashes here.

Provenance

The following attestation bundles were made for llemb-0.2.0.tar.gz:

Publisher: publish.yml on j341nono/llemb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llemb-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llemb-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llemb-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a6b361e4000cced170d76dcf0c7204cf05fad5686f5da25a31600d8da377c19
MD5 bd9cc1bee801f2775c0ca2db9db17d39
BLAKE2b-256 24515f3fb586d94c547c27861dea2f644ff6bfd0345230a4863c673b5ecba787

See more details on using hashes here.

Provenance

The following attestation bundles were made for llemb-0.2.0-py3-none-any.whl:

Publisher: publish.yml on j341nono/llemb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page