A Python library for computing Embedding Similarity scores using embeddings
Project description
embedsim
Measure semantic similarity and detect outliers in text collections using embeddings.
embedsim is a lightweight Python library that helps you understand how well texts relate to each other. It provides two core functions: pairwise similarity for comparing two texts, and group coherence for analyzing collections.
Use cases:
- Content moderation: Find off-topic comments or reviews
- Document clustering: Identify outliers before grouping
- Quality assurance: Verify generated content stays on topic
- Search relevance: Score how well results match a query theme
- Duplicate detection: Compare documents for similarity
Installation
For OpenAI models:
uv add embedsim[openai]
export OPENAI_API_KEY=your-key-here
For local models:
uv add embedsim[sentence-transformers]
Quick Start
Pairwise Similarity
Compare two texts directly:
import embedsim
# Similar texts
score = embedsim.pairsim(
"The cat sat on the mat",
"A feline rested on the rug"
)
print(score) # 0.89
# Dissimilar texts
score = embedsim.pairsim(
"The cat sat on the mat",
"Python is a programming language"
)
print(score) # 0.21
Group Coherence
Analyze a collection and find outliers:
import embedsim
texts = [
"Python is a programming language",
"JavaScript is used for web development",
"Machine learning uses neural networks",
"Pizza is a popular food" # This doesn't belong
]
scores = embedsim.groupsim(texts)
# [0.76, 0.73, 0.71, 0.28]
# ~~~~ Outlier detected!
API Reference
pairsim(text_a, text_b, model_id=None, **config) → float
Compute similarity between two texts.
- Converts both texts to embeddings
- Computes cosine similarity
- Returns a single similarity score (0-1, higher = more similar)
groupsim(texts, model_id=None, **config) → list[float]
Compute coherence scores for a collection of texts.
- Converts all texts to embeddings
- Calculates the centroid (average) of all embeddings
- Measures how close each text is to the centroid
- Returns coherence scores (0-1, higher = more coherent)
This centroid-based approach gives you a score per text showing how well it fits with the group's semantic theme.
Configuration
Runtime Configuration
Modify the config object directly in your code:
import embedsim
# Change default model at runtime
embedsim.config.model = "jinaai/jina-embeddings-v2-base-en"
# Now all calls use the new default
score = embedsim.pairsim("hello", "hi")
Environment Variables
Alternatively, set configuration via environment variables:
# Set default model
export EMBEDSIM_MODEL=jinaai/jina-embeddings-v2-base-en
# Use custom OpenAI key
export EMBEDSIM_OPENAI_API_KEY=sk-...
Models
embedsim supports both OpenAI's API and local sentence-transformer models.
See MODELS.md for detailed model comparison and selection guide.
OpenAI (default, requires API key):
# Best for production - fast, accurate, no model downloads
score = embedsim.pairsim(text_a, text_b) # uses openai/text-embedding-3-small
scores = embedsim.groupsim(texts, model_id="openai/text-embedding-3-large")
Local models (privacy, offline):
# Run entirely on your machine
score = embedsim.pairsim(text_a, text_b, model_id="jinaai/jina-embeddings-v2-base-en")
scores = embedsim.groupsim(texts, model_id="sentence-transformers/all-MiniLM-L6-v2")
Development
# Install with dev dependencies
uv sync --all-extras
# Run tests and benchmarks
make test
License
MIT
Links
- Model comparison - Detailed guide to choosing the right embedding model
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embedsim-0.1.1.tar.gz.
File metadata
- Download URL: embedsim-0.1.1.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6849065fad3e644c4d1962b71f725a562a7cec1395582e317caad1e9138e1b4
|
|
| MD5 |
19231cf255946c93776a1c11e2c0cba1
|
|
| BLAKE2b-256 |
058493bf9c6986891bf76bfaf9739c8cb6108f7e81ca97d959385e3ebf227acd
|
File details
Details for the file embedsim-0.1.1-py3-none-any.whl.
File metadata
- Download URL: embedsim-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7eb429d7a0663d998481d30f790c260b1c54ccc725d41ebc997a2d9aca6f9a12
|
|
| MD5 |
e7f4d981a3710c238a7e7a696179a3a7
|
|
| BLAKE2b-256 |
de67157c07050d59ededf42d5027309282e240a8d30e22acac86cb50a9cdf3d3
|