Lightweight Python library for sentence/semantic embeddings — SentenceTransformers & OpenAI in one unified API.
Project description
sentence-embedder
A lightweight internal Python library for generating sentence/semantic embeddings with a clean, unified API.
Features
| Feature | Details |
|---|---|
| Backends | SentenceTransformers (local, offline) · OpenAI API |
| Single & batch | embed() and embed_batch() |
| Similarity | Cosine similarity between two sentences |
| Semantic search | most_similar() over an in-memory corpus |
| Disk cache | EmbeddingCache — skip re-embedding seen texts |
| Typed | Full type hints, compatible with mypy |
Installation
# Clone / copy this package into your project, then:
# SentenceTransformers backend (local, recommended)
pip install -e ".[st]"
# OpenAI backend
pip install -e ".[openai]"
# Both
pip install -e ".[all]"
# With dev tools (pytest, ruff, mypy)
pip install -e ".[all,dev]"
Quick Start
1 — Basic embedding
from sentence_embedder import SentenceEmbedder
embedder = SentenceEmbedder() # defaults: SentenceTransformers, all-MiniLM-L6-v2
vec = embedder.embed("The cat sat on the mat.")
print(vec.shape) # (384,)
2 — Batch embedding
texts = ["Hello world", "Machine learning is fun", "I love Python"]
vecs = embedder.embed_batch(texts)
print(vecs.shape) # (3, 384)
3 — Cosine similarity
score = embedder.similarity("fast car", "quick automobile")
print(score) # ~0.85
4 — Semantic search
corpus = [
"The stock market crashed today.",
"Scientists discover a new planet.",
"Football team wins the championship.",
"A new AI model beats human performance.",
]
results = embedder.most_similar("breakthrough in artificial intelligence", corpus, top_k=2)
for sentence, score in results:
print(f"{score:.3f} {sentence}")
5 — Disk cache (avoid re-embedding)
from sentence_embedder import SentenceEmbedder, EmbeddingCache
base = SentenceEmbedder()
embedder = EmbeddingCache(base, cache_dir=".cache/embeddings")
vec = embedder.embed("Hello world") # computed → stored on disk
vec = embedder.embed("Hello world") # loaded from cache instantly
OpenAI Backend
from sentence_embedder import SentenceEmbedder
embedder = SentenceEmbedder(
backend="openai",
model_name="text-embedding-3-small",
openai_api_key="sk-...", # or set OPENAI_API_KEY env var
)
vec = embedder.embed("Hello from OpenAI!")
API Reference
SentenceEmbedder
| Method | Signature | Description |
|---|---|---|
embed |
(text: str) → ndarray |
Embed one sentence → 1-D vector |
embed_batch |
(texts: List[str]) → ndarray |
Embed many → 2-D array (N, dim) |
similarity |
(a: str, b: str) → float |
Cosine similarity in [-1, 1] |
most_similar |
(query, corpus, top_k=5) → List[tuple] |
Ranked (sentence, score) pairs |
EmbeddingCache
| Method | Signature | Description |
|---|---|---|
embed |
(text: str) → ndarray |
Cache-aware single embed |
embed_batch |
(texts: List[str]) → ndarray |
Cache-aware batch embed |
clear |
() → None |
Wipe the cache database |
Running Tests
pytest
# With coverage:
pytest --cov=sentence_embedder --cov-report=term-missing
Package Structure
sentence_embedder/
├── sentence_embedder/
│ ├── __init__.py # Public exports
│ ├── embedder.py # SentenceEmbedder (core)
│ └── cache.py # EmbeddingCache (disk cache)
├── tests/
│ └── test_embedder.py # Unit tests (no model/network needed)
├── pyproject.toml # Build config & dependencies
└── README.md
Choosing a Model
| Model | Dim | Speed | Quality | Use case |
|---|---|---|---|---|
all-MiniLM-L6-v2 |
384 | ⚡⚡⚡ | ★★★ | Default, general purpose |
all-mpnet-base-v2 |
768 | ⚡⚡ | ★★★★ | Higher quality |
multi-qa-MiniLM-L6-cos-v1 |
384 | ⚡⚡⚡ | ★★★★ | Q&A / search |
text-embedding-3-small (OpenAI) |
1536 | API | ★★★★★ | Best quality, needs key |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentence_embedder-1.0.0.tar.gz.
File metadata
- Download URL: sentence_embedder-1.0.0.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22c040f1811e9546a147e29fc4e9c38161ffe182a382252c7c0ec65796d2cbd7
|
|
| MD5 |
60686560780e8da2296c173386a43b4b
|
|
| BLAKE2b-256 |
bfa84b82b42a18d65f681d9d5aaadad288b1d4d530c1b60ac143c2508d19d9ac
|
Provenance
The following attestation bundles were made for sentence_embedder-1.0.0.tar.gz:
Publisher:
publish.yml on Vennilaganesan/document-embedding
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sentence_embedder-1.0.0.tar.gz -
Subject digest:
22c040f1811e9546a147e29fc4e9c38161ffe182a382252c7c0ec65796d2cbd7 - Sigstore transparency entry: 1157518495
- Sigstore integration time:
-
Permalink:
Vennilaganesan/document-embedding@b9172eb18b6127305961472385c544151689f11b -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Vennilaganesan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b9172eb18b6127305961472385c544151689f11b -
Trigger Event:
push
-
Statement type:
File details
Details for the file sentence_embedder-1.0.0-py3-none-any.whl.
File metadata
- Download URL: sentence_embedder-1.0.0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9587087c90be0e819d938f20a8e1a09ca6eb242d77fab21c0f30dbe5b339964c
|
|
| MD5 |
81760e911afeff1961d9bc39a9e0e672
|
|
| BLAKE2b-256 |
460d299e0628157c288ca2c9dfb69ed52fe9cf977688ffa4f2a0a041b60b3179
|
Provenance
The following attestation bundles were made for sentence_embedder-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on Vennilaganesan/document-embedding
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sentence_embedder-1.0.0-py3-none-any.whl -
Subject digest:
9587087c90be0e819d938f20a8e1a09ca6eb242d77fab21c0f30dbe5b339964c - Sigstore transparency entry: 1157518563
- Sigstore integration time:
-
Permalink:
Vennilaganesan/document-embedding@b9172eb18b6127305961472385c544151689f11b -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Vennilaganesan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b9172eb18b6127305961472385c544151689f11b -
Trigger Event:
push
-
Statement type: