Sentence embeddings from any local causal LLM, no fine-tuning.
Project description
local-llm-embed
Get sentence embeddings out of any local causal LLM you already have running. No fine-tuning, no separate encoder model.
If you're running Llama / Qwen / Phi / Mistral / TinyLlama via transformers, Ollama, vLLM, or llama.cpp and want embeddings for RAG / retrieval / classification — this library extracts them from the model you already have, in a few lines of code.
Why
| dedicated encoder (BGE-M3, MiniLM) | this library | |
|---|---|---|
| Need a separate ~500 MB model load | yes | no (reuses your LLM) |
| Need fine-tuning | already trained | none |
| Works on any HF causal LM | n/a | yes |
| STS Spearman vs MiniLM-L6 | 0.867 (baseline) | 0.806 (Phi-3.5) |
| Banking77 accuracy vs MiniLM-L6 | 0.5500 | 0.5540 (wins) |
The trade-off is honest: dedicated encoders still beat raw LLM probes on pure semantic similarity (STS) by ~6 points. But on classification-style tasks like Banking77, this library matches or slightly beats the baseline — using a model you already have in memory.
Install
pip install local-llm-embed
Or with the Hugging Face Hub helpers (for downloading pre-fit whiteners):
pip install "local-llm-embed[hub]"
Quick start
from local_llm_embed import LocalLLMEmbedder
embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct")
emb = embedder.encode(["The cat sat on the mat.", "A feline rested."])
print(emb.shape) # (2, 896)
print(emb @ emb.T) # cosine similarity matrix
By default this uses prefix+whiten (the most universally strong variation
in our benchmarks). The whitener is fit lazily on the first batch you encode.
Use a calibration set for better whitening
calibration = [...] # ~1000 representative texts from your domain
embedder.fit_whitener(calibration)
embedder.save_whitener("./domain_whiten.npz")
# later, in another session:
embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct",
whitener_path="./domain_whiten.npz")
Pick a different variation
embedder = LocalLLMEmbedder(
"microsoft/Phi-3.5-mini-instruct",
variation="echo+whiten", # best for STS
layer="final",
pooling="weighted_mean",
)
Variations
Three train-free recipes are bundled. They're combinations of well-known
techniques, ranked by how well they performed in our internal benchmark
(STSB validation, Banking77; see BENCHMARKS.md):
prefix+whiten(default) — feed the text in, take the chosen layer + pooling, then center & ZCA-whiten the resulting matrix. Whitening removes the anisotropy ("everything-looks-similar") problem that causal LMs suffer from. Universal +0.12 to +0.22 STS Spearman over no whitening.echo+whiten— duplicate the text and pool only over the second copy, so each pooled token has seen the full sentence (works around the causal mask). Best STS combination in our tests.prefix— no transformation. For comparison / debugging.
Hardware notes
Causal LMs at fp32 are heavy on RAM. We default to bf16 if your CPU
has the avx512_bf16 flag (most modern AMD / Intel desktop CPUs do); on
GPU just pass device="cuda" and the same bf16 default applies.
Limitations (be honest)
- For pure semantic textual similarity, dedicated contrastive encoders (BGE-M3, all-MiniLM-L6-v2) still win by ~6 STS points. This library is for the case where you already have a causal LM loaded.
- Whitening requires a calibration set of at least a few hundred texts. Without it, self-whitening is used at encode time (fits on the batch you're encoding). That's worse than a good calibration set but still better than raw probes.
bidirectionalinference (LLM2Vec-style attention-mask removal) is not bundled. We benchmarked it; it consistently hurt without fine-tuning and we don't want to ship a footgun.
Acknowledgements
The technique is a combination of BERT-whitening (Su et al. 2021), Echo Embeddings (Springer et al. 2024), and PromptEOL (Jiang et al. 2023). This library packages the train-free subset.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_llm_embed-0.1.0.tar.gz.
File metadata
- Download URL: local_llm_embed-0.1.0.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5163f7c828828a24135e0d455ac4934e4f846144b975784739e38bc4eb92a47
|
|
| MD5 |
3f9bed47f0bf9c06759ce55f878be5a4
|
|
| BLAKE2b-256 |
af6b6c32ea85903efda1cdaaadd6d328f0fe5ffbb2190c8b2ead55fed30a9060
|
File details
Details for the file local_llm_embed-0.1.0-py3-none-any.whl.
File metadata
- Download URL: local_llm_embed-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bb3e65df4fbfcd17bfbb9eb5bfc08cb7c97c6d61ee11122739906f1f1d71cd9
|
|
| MD5 |
fd9a8933ba6015f6d41f42c08bd7a49d
|
|
| BLAKE2b-256 |
478faaa989ec6558679e76cc6d54f49f53f48bf8d1a72b25455ac19318d885c3
|