Embedding extractor for decoder-only LLMs

Project description

llemb: Unified Embedding Extraction from Decoder-only LLMs

llemb is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.

With llemb, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like PromptEOL and PCoTEOL, with built-in support for quantization to run on consumer hardware.

Features

Flexible Backends: Seamless support for Hugging Face Transformers.
Advanced Pooling Strategies:
- Standard: mean, last_token, eos_token
- Research-grade: prompt_eol, pcoteol (Pretended Chain of Thought), ke (Knowledge Enhancement)
Efficient Inference: Native support for 4-bit and 8-bit quantization via bitsandbytes.
Granular Control: Extract embeddings from any layer (defaults to recommended layers based on research).

Installation

Install via PyPI using pip or uv.

Basic Installation

pip install llemb
# or
uv add llemb

With Quantization Support

To enable 4-bit/8-bit quantization (recommended for large models):

pip install "llemb[quantization]"
# or
uv add llemb[quantization]

Getting Started

Initialize the encoder and start extracting embeddings in just a few lines of code.

Basic Usage

import llemb

# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling="mean")

print(embeddings.shape)
# => (1, 4096)

Advanced Usage (Quantization & Research Strategies)

Use quantization to reduce memory usage and apply advanced pooling strategies like pcoteol for better representation.

import llemb

# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    backend="transformers",
    device="cuda",
    quantization="4bit"
)

# Extract using "Pretended Chain of Thought" strategy
# Note: Automatically uses the second-to-last layer (layer -2) as recommended
embeddings = enc.encode("Hello world", pooling="pcoteol")

Configuration & Optimization

llemb passes arguments directly to the backend, allowing for deep customization.

Using Flash Attention 2

import torch

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Custom Quantization Config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)

Supported Pooling Strategies

Strategy	Description	Recommended Layer
`mean`	Average pooling of all tokens (excluding padding).	-1 (Last)
`last_token`	Vector of the last generated token.	-1 (Last)
`eos_token`	Vector corresponding to the EOS token position.	-1 (Last)
`prompt_eol`	Embeddings extracted using a prompt template targeting the last token.	-1 (Last)
`pcoteol`	"Pretended Chain of Thought" - wraps input in a reasoning template.	-2
`ke`	"Knowledge Enhancement" - wraps input in a context-aware template.	-2

Development

Clone the repository and sync dependencies using uv:

git clone [https://github.com/j341nono/llemb.git](https://github.com/j341nono/llemb.git)
cd llemb
uv sync --all-extras --dev

Run Tests

uv run pytest

Static Analysis

uv run ruff check src
uv run mypy src

Citations

If you use the advanced pooling strategies implemented in this library, please cite the respective original papers:

PromptEOL:

@inproceedings{jiang-etal-2024-scaling,
    title = "Scaling Sentence Embeddings with Large Language Models",
    author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    year = "2024"
}

PCoTEOL and KE:

@article{zhang2024simple,
    title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
    author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
    journal={arXiv preprint arXiv:2404.03921},
    year={2024}
}

License

This project is open source and available under the Apache-2.0 license.

Project details

Release history Release notifications | RSS feed

0.3.0

Jan 29, 2026

0.2.2

Jan 28, 2026

This version

0.2.1

Jan 20, 2026

0.2.0

Jan 20, 2026

0.1.0

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llemb-0.2.1.tar.gz (16.5 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llemb-0.2.1-py3-none-any.whl (14.5 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file llemb-0.2.1.tar.gz.

File metadata

Download URL: llemb-0.2.1.tar.gz
Upload date: Jan 20, 2026
Size: 16.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llemb-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`9fec10cf4780b728634d36cf52af36eb6af673472c227bf7e94a47079acfef3c`
MD5	`21087ec6b0a468be5bb104dcfc989fbe`
BLAKE2b-256	`ed6357115a0af29c8774ad6979a8bcdc2a26e025706e55ff59489bcd3d79d4d1`

See more details on using hashes here.

File details

Details for the file llemb-0.2.1-py3-none-any.whl.

File metadata

Download URL: llemb-0.2.1-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llemb-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8896517dbdb331dcee71d8532136d4db2cf14ed90a25e2adbd2024260a1985c`
MD5	`83f99d3a7ffa849664fcd4acbd1c3acd`
BLAKE2b-256	`b55210e0ed5d45b57f03bb3be658bca86667d869bfcdb38589a6548850f6b8ca`

See more details on using hashes here.

llemb 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

llemb: Unified Embedding Extraction from Decoder-only LLMs

Features

Installation

Getting Started

Basic Usage

Advanced Usage (Quantization & Research Strategies)

Configuration & Optimization

Supported Pooling Strategies

Development

Citations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes