Skip to main content

A Python library for computing Embedding Similarity scores using embeddings

Project description

embedsim

Measure semantic similarity and detect outliers in text collections using embeddings.

embedsim is a lightweight Python library that helps you understand how well texts relate to each other. It provides two core functions: pairwise similarity for comparing two texts, and group coherence for analyzing collections.

Use cases:

  • Content moderation: Find off-topic comments or reviews
  • Document clustering: Identify outliers before grouping
  • Quality assurance: Verify generated content stays on topic
  • Search relevance: Score how well results match a query theme
  • Duplicate detection: Compare documents for similarity

Installation

For OpenAI models:

uv add embedsim[openai]
export OPENAI_API_KEY=your-key-here

For local models:

uv add embedsim[sentence-transformers]

Quick Start

Pairwise Similarity

Compare two texts directly:

import embedsim

# Similar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "A feline rested on the rug"
)
print(score)  # 0.89

# Dissimilar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "Python is a programming language"
)
print(score)  # 0.21

Group Coherence

Analyze a collection and find outliers:

import embedsim

texts = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning uses neural networks",
    "Pizza is a popular food"  # This doesn't belong
]

scores = embedsim.groupsim(texts)
# [0.76, 0.73, 0.71, 0.28]
#                    ~~~~ Outlier detected!

API Reference

pairsim(text_a, text_b, model_id=None, **config) → float

Compute similarity between two texts.

  • Converts both texts to embeddings
  • Computes cosine similarity
  • Returns a single similarity score (0-1, higher = more similar)

groupsim(texts, model_id=None, **config) → list[float]

Compute coherence scores for a collection of texts.

  • Converts all texts to embeddings
  • Calculates the centroid (average) of all embeddings
  • Measures how close each text is to the centroid
  • Returns coherence scores (0-1, higher = more coherent)

This centroid-based approach gives you a score per text showing how well it fits with the group's semantic theme.

Configuration

Runtime Configuration

Modify the config object directly in your code:

import embedsim

# Change default model at runtime
embedsim.config.model = "jinaai/jina-embeddings-v2-base-en"

# Now all calls use the new default
score = embedsim.pairsim("hello", "hi")

Environment Variables

Alternatively, set configuration via environment variables:

# Set default model
export EMBEDSIM_MODEL=jinaai/jina-embeddings-v2-base-en

# Use custom OpenAI key
export EMBEDSIM_OPENAI_API_KEY=sk-...

Models

embedsim supports both OpenAI's API and local sentence-transformer models.

See MODELS.md for detailed model comparison and selection guide.

OpenAI (default, requires API key):

# Best for production - fast, accurate, no model downloads
score = embedsim.pairsim(text_a, text_b)  # uses openai/text-embedding-3-small
scores = embedsim.groupsim(texts, model_id="openai/text-embedding-3-large")

Local models (privacy, offline):

# Run entirely on your machine
score = embedsim.pairsim(text_a, text_b, model_id="jinaai/jina-embeddings-v2-base-en")
scores = embedsim.groupsim(texts, model_id="sentence-transformers/all-MiniLM-L6-v2")

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests and benchmarks
make test

License

MIT

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedsim-0.1.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedsim-0.1.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file embedsim-0.1.1.tar.gz.

File metadata

  • Download URL: embedsim-0.1.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.23

File hashes

Hashes for embedsim-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e6849065fad3e644c4d1962b71f725a562a7cec1395582e317caad1e9138e1b4
MD5 19231cf255946c93776a1c11e2c0cba1
BLAKE2b-256 058493bf9c6986891bf76bfaf9739c8cb6108f7e81ca97d959385e3ebf227acd

See more details on using hashes here.

File details

Details for the file embedsim-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: embedsim-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.23

File hashes

Hashes for embedsim-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7eb429d7a0663d998481d30f790c260b1c54ccc725d41ebc997a2d9aca6f9a12
MD5 e7f4d981a3710c238a7e7a696179a3a7
BLAKE2b-256 de67157c07050d59ededf42d5027309282e240a8d30e22acac86cb50a9cdf3d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page