A Python library for computing Embedding Similarity scores using embeddings

These details have not been verified by PyPI

Project links

Project description

`embedsim`

Measure semantic similarity and detect outliers in text collections using embeddings.

embedsim is a lightweight Python library that helps you understand how well texts relate to each other. It provides two core functions: pairwise similarity for comparing two texts, and group coherence for analyzing collections.

Use cases:

Content moderation: Find off-topic comments or reviews
Document clustering: Identify outliers before grouping
Quality assurance: Verify generated content stays on topic
Search relevance: Score how well results match a query theme
Duplicate detection: Compare documents for similarity

Installation

For OpenAI models:

uv add embedsim[openai]
export OPENAI_API_KEY=your-key-here

For local models:

uv add embedsim[sentence-transformers]

Quick Start

Pairwise Similarity

Compare two texts directly:

import embedsim

# Similar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "A feline rested on the rug"
)
print(score)  # 0.89

# Dissimilar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "Python is a programming language"
)
print(score)  # 0.21

Group Coherence

Analyze a collection and find outliers:

import embedsim

texts = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning uses neural networks",
    "Pizza is a popular food"  # This doesn't belong
]

scores = embedsim.groupsim(texts)
# [0.76, 0.73, 0.71, 0.28]
#                    ~~~~ Outlier detected!

API Reference

`pairsim(text_a, text_b, model_id=None, **config) → float`

Compute similarity between two texts.

Converts both texts to embeddings
Computes cosine similarity
Returns a single similarity score (0-1, higher = more similar)

`groupsim(texts, model_id=None, **config) → list[float]`

Compute coherence scores for a collection of texts.

Converts all texts to embeddings
Calculates the centroid (average) of all embeddings
Measures how close each text is to the centroid
Returns coherence scores (0-1, higher = more coherent)

This centroid-based approach gives you a score per text showing how well it fits with the group's semantic theme.

Configuration

Runtime Configuration

Modify the config object directly in your code:

import embedsim

# Change default model at runtime
embedsim.config.model = "jinaai/jina-embeddings-v2-base-en"

# Now all calls use the new default
score = embedsim.pairsim("hello", "hi")

Environment Variables

Alternatively, set configuration via environment variables:

# Set default model
export EMBEDSIM_MODEL=jinaai/jina-embeddings-v2-base-en

# Use custom OpenAI key
export EMBEDSIM_OPENAI_API_KEY=sk-...

Models

embedsim supports both OpenAI's API and local sentence-transformer models.

See MODELS.md for detailed model comparison and selection guide.

OpenAI (default, requires API key):

# Best for production - fast, accurate, no model downloads
score = embedsim.pairsim(text_a, text_b)  # uses openai/text-embedding-3-small
scores = embedsim.groupsim(texts, model_id="openai/text-embedding-3-large")

Local models (privacy, offline):

# Run entirely on your machine
score = embedsim.pairsim(text_a, text_b, model_id="jinaai/jina-embeddings-v2-base-en")
scores = embedsim.groupsim(texts, model_id="sentence-transformers/all-MiniLM-L6-v2")

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests and benchmarks
make test

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Oct 5, 2025

0.1.0

Oct 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedsim-0.1.1.tar.gz (9.7 kB view details)

Uploaded Oct 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedsim-0.1.1-py3-none-any.whl (7.7 kB view details)

Uploaded Oct 5, 2025 Python 3

File details

Details for the file embedsim-0.1.1.tar.gz.

File metadata

Download URL: embedsim-0.1.1.tar.gz
Upload date: Oct 5, 2025
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.23

File hashes

Hashes for embedsim-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e6849065fad3e644c4d1962b71f725a562a7cec1395582e317caad1e9138e1b4`
MD5	`19231cf255946c93776a1c11e2c0cba1`
BLAKE2b-256	`058493bf9c6986891bf76bfaf9739c8cb6108f7e81ca97d959385e3ebf227acd`

See more details on using hashes here.

File details

Details for the file embedsim-0.1.1-py3-none-any.whl.

File metadata

Download URL: embedsim-0.1.1-py3-none-any.whl
Upload date: Oct 5, 2025
Size: 7.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.23

File hashes

Hashes for embedsim-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7eb429d7a0663d998481d30f790c260b1c54ccc725d41ebc997a2d9aca6f9a12`
MD5	`e7f4d981a3710c238a7e7a696179a3a7`
BLAKE2b-256	`de67157c07050d59ededf42d5027309282e240a8d30e22acac86cb50a9cdf3d3`

See more details on using hashes here.

embedsim 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

embedsim

Installation

Quick Start

Pairwise Similarity

Group Coherence

API Reference

pairsim(text_a, text_b, model_id=None, **config) → float

groupsim(texts, model_id=None, **config) → list[float]

Configuration

Runtime Configuration

Environment Variables

Models

Development

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`embedsim`

`pairsim(text_a, text_b, model_id=None, **config) → float`

`groupsim(texts, model_id=None, **config) → list[float]`