Skip to main content

Sentence embeddings from any local causal LLM, no fine-tuning.

Project description

local-llm-embed

Get sentence embeddings out of any local causal LLM you already have running. No fine-tuning, no separate encoder model.

If you're running Llama / Qwen / Phi / Mistral / TinyLlama via transformers, Ollama, vLLM, or llama.cpp and want embeddings for RAG / retrieval / classification — this library extracts them from the model you already have, in a few lines of code.

Why

dedicated encoder (BGE-M3, MiniLM) this library
Need a separate ~500 MB model load yes no (reuses your LLM)
Need fine-tuning already trained none
Works on any HF causal LM n/a yes
STS Spearman vs MiniLM-L6 0.867 (baseline) 0.806 (Phi-3.5)
Banking77 accuracy vs MiniLM-L6 0.5500 0.5540 (wins)

The trade-off is honest: dedicated encoders still beat raw LLM probes on pure semantic similarity (STS) by ~6 points. But on classification-style tasks like Banking77, this library matches or slightly beats the baseline — using a model you already have in memory.

Install

pip install local-llm-embed

Or with the Hugging Face Hub helpers (for downloading pre-fit whiteners):

pip install "local-llm-embed[hub]"

Quick start

from local_llm_embed import LocalLLMEmbedder

embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct")
emb = embedder.encode(["The cat sat on the mat.", "A feline rested."])
print(emb.shape)  # (2, 896)
print(emb @ emb.T)  # cosine similarity matrix

By default this uses prefix+whiten (the most universally strong variation in our benchmarks). The whitener is fit lazily on the first batch you encode.

Use a calibration set for better whitening

calibration = [...]  # ~1000 representative texts from your domain
embedder.fit_whitener(calibration)
embedder.save_whitener("./domain_whiten.npz")

# later, in another session:
embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct",
                             whitener_path="./domain_whiten.npz")

Pick a different variation

embedder = LocalLLMEmbedder(
    "microsoft/Phi-3.5-mini-instruct",
    variation="echo+whiten",   # best for STS
    layer="final",
    pooling="weighted_mean",
)

Variations

Three train-free recipes are bundled. They're combinations of well-known techniques, ranked by how well they performed in our internal benchmark (STSB validation, Banking77; see BENCHMARKS.md):

  • prefix+whiten (default) — feed the text in, take the chosen layer + pooling, then center & ZCA-whiten the resulting matrix. Whitening removes the anisotropy ("everything-looks-similar") problem that causal LMs suffer from. Universal +0.12 to +0.22 STS Spearman over no whitening.
  • echo+whiten — duplicate the text and pool only over the second copy, so each pooled token has seen the full sentence (works around the causal mask). Best STS combination in our tests.
  • prefix — no transformation. For comparison / debugging.

Hardware notes

Causal LMs at fp32 are heavy on RAM. We default to bf16 if your CPU has the avx512_bf16 flag (most modern AMD / Intel desktop CPUs do); on GPU just pass device="cuda" and the same bf16 default applies.

Limitations (be honest)

  • For pure semantic textual similarity, dedicated contrastive encoders (BGE-M3, all-MiniLM-L6-v2) still win by ~6 STS points. This library is for the case where you already have a causal LM loaded.
  • Whitening requires a calibration set of at least a few hundred texts. Without it, self-whitening is used at encode time (fits on the batch you're encoding). That's worse than a good calibration set but still better than raw probes.
  • bidirectional inference (LLM2Vec-style attention-mask removal) is not bundled. We benchmarked it; it consistently hurt without fine-tuning and we don't want to ship a footgun.

Acknowledgements

The technique is a combination of BERT-whitening (Su et al. 2021), Echo Embeddings (Springer et al. 2024), and PromptEOL (Jiang et al. 2023). This library packages the train-free subset.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_llm_embed-0.1.0.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

local_llm_embed-0.1.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file local_llm_embed-0.1.0.tar.gz.

File metadata

  • Download URL: local_llm_embed-0.1.0.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for local_llm_embed-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a5163f7c828828a24135e0d455ac4934e4f846144b975784739e38bc4eb92a47
MD5 3f9bed47f0bf9c06759ce55f878be5a4
BLAKE2b-256 af6b6c32ea85903efda1cdaaadd6d328f0fe5ffbb2190c8b2ead55fed30a9060

See more details on using hashes here.

File details

Details for the file local_llm_embed-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: local_llm_embed-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for local_llm_embed-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bb3e65df4fbfcd17bfbb9eb5bfc08cb7c97c6d61ee11122739906f1f1d71cd9
MD5 fd9a8933ba6015f6d41f42c08bd7a49d
BLAKE2b-256 478faaa989ec6558679e76cc6d54f49f53f48bf8d1a72b25455ac19318d885c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page