Skip to main content

Lightweight Python library for sentence/semantic embeddings — SentenceTransformers & OpenAI in one unified API.

Project description

sentence-embedder

A lightweight internal Python library for generating sentence/semantic embeddings with a clean, unified API.


Features

Feature Details
Backends SentenceTransformers (local, offline) · OpenAI API
Single & batch embed() and embed_batch()
Similarity Cosine similarity between two sentences
Semantic search most_similar() over an in-memory corpus
Disk cache EmbeddingCache — skip re-embedding seen texts
Typed Full type hints, compatible with mypy

Installation

# Clone / copy this package into your project, then:

# SentenceTransformers backend (local, recommended)
pip install -e ".[st]"

# OpenAI backend
pip install -e ".[openai]"

# Both
pip install -e ".[all]"

# With dev tools (pytest, ruff, mypy)
pip install -e ".[all,dev]"

Quick Start

1 — Basic embedding

from sentence_embedder import SentenceEmbedder

embedder = SentenceEmbedder()                 # defaults: SentenceTransformers, all-MiniLM-L6-v2

vec = embedder.embed("The cat sat on the mat.")
print(vec.shape)   # (384,)

2 — Batch embedding

texts = ["Hello world", "Machine learning is fun", "I love Python"]
vecs = embedder.embed_batch(texts)
print(vecs.shape)  # (3, 384)

3 — Cosine similarity

score = embedder.similarity("fast car", "quick automobile")
print(score)  # ~0.85

4 — Semantic search

corpus = [
    "The stock market crashed today.",
    "Scientists discover a new planet.",
    "Football team wins the championship.",
    "A new AI model beats human performance.",
]

results = embedder.most_similar("breakthrough in artificial intelligence", corpus, top_k=2)
for sentence, score in results:
    print(f"{score:.3f}  {sentence}")

5 — Disk cache (avoid re-embedding)

from sentence_embedder import SentenceEmbedder, EmbeddingCache

base = SentenceEmbedder()
embedder = EmbeddingCache(base, cache_dir=".cache/embeddings")

vec = embedder.embed("Hello world")   # computed → stored on disk
vec = embedder.embed("Hello world")   # loaded from cache instantly

OpenAI Backend

from sentence_embedder import SentenceEmbedder

embedder = SentenceEmbedder(
    backend="openai",
    model_name="text-embedding-3-small",
    openai_api_key="sk-...",   # or set OPENAI_API_KEY env var
)

vec = embedder.embed("Hello from OpenAI!")

API Reference

SentenceEmbedder

Method Signature Description
embed (text: str) → ndarray Embed one sentence → 1-D vector
embed_batch (texts: List[str]) → ndarray Embed many → 2-D array (N, dim)
similarity (a: str, b: str) → float Cosine similarity in [-1, 1]
most_similar (query, corpus, top_k=5) → List[tuple] Ranked (sentence, score) pairs

EmbeddingCache

Method Signature Description
embed (text: str) → ndarray Cache-aware single embed
embed_batch (texts: List[str]) → ndarray Cache-aware batch embed
clear () → None Wipe the cache database

Running Tests

pytest
# With coverage:
pytest --cov=sentence_embedder --cov-report=term-missing

Package Structure

sentence_embedder/
├── sentence_embedder/
│   ├── __init__.py       # Public exports
│   ├── embedder.py       # SentenceEmbedder (core)
│   └── cache.py          # EmbeddingCache (disk cache)
├── tests/
│   └── test_embedder.py  # Unit tests (no model/network needed)
├── pyproject.toml        # Build config & dependencies
└── README.md

Choosing a Model

Model Dim Speed Quality Use case
all-MiniLM-L6-v2 384 ⚡⚡⚡ ★★★ Default, general purpose
all-mpnet-base-v2 768 ⚡⚡ ★★★★ Higher quality
multi-qa-MiniLM-L6-cos-v1 384 ⚡⚡⚡ ★★★★ Q&A / search
text-embedding-3-small (OpenAI) 1536 API ★★★★★ Best quality, needs key

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence_embedder-1.0.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentence_embedder-1.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file sentence_embedder-1.0.0.tar.gz.

File metadata

  • Download URL: sentence_embedder-1.0.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sentence_embedder-1.0.0.tar.gz
Algorithm Hash digest
SHA256 22c040f1811e9546a147e29fc4e9c38161ffe182a382252c7c0ec65796d2cbd7
MD5 60686560780e8da2296c173386a43b4b
BLAKE2b-256 bfa84b82b42a18d65f681d9d5aaadad288b1d4d530c1b60ac143c2508d19d9ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for sentence_embedder-1.0.0.tar.gz:

Publisher: publish.yml on Vennilaganesan/document-embedding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sentence_embedder-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sentence_embedder-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9587087c90be0e819d938f20a8e1a09ca6eb242d77fab21c0f30dbe5b339964c
MD5 81760e911afeff1961d9bc39a9e0e672
BLAKE2b-256 460d299e0628157c288ca2c9dfb69ed52fe9cf977688ffa4f2a0a041b60b3179

See more details on using hashes here.

Provenance

The following attestation bundles were made for sentence_embedder-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Vennilaganesan/document-embedding

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page