Skip to main content

Blazing-fast similarity scores for strings, vectors, points, and sets.

Project description

simmetry

Similarity scores for strings, vectors, points, and sets with a small, NumPy-first API.

PyPI (simmetry)

Install

pip install simmetry
pip install "simmetry[fast]"
  • simmetry[fast]: enables optional Numba acceleration for pairwise(..., metric="euclidean_sim") and pairwise(..., metric="manhattan_sim")
  • ANN extras:
    • pip install "simmetry[ann-hnsw]"
    • pip install "simmetry[ann-faiss]"

Project Status

  • Current package: simmetry on PyPI
  • Current version in this repo: 1.0.3
  • Maturity: Alpha (API may change; pin exact/minor versions in production)
  • Versioning: semantic versioning target, but pre-hardening changes may still occur in minor releases until 1.x stabilizes

Quickstart

One function

from simmetry import similarity

similarity("kitten", "sitting", metric="levenshtein")
similarity([1, 2, 3], [1, 2, 4], metric="cosine")
similarity((41.1, 29.0), (41.2, 29.1), metric="haversine_km")
similarity({1, 2, 3}, {2, 3, 4}, metric="jaccard")

haversine_km returns geographic distance in kilometers.

Pairwise matrices (vectors)

import numpy as np
from simmetry import pairwise

X = np.random.randn(1000, 128)
S = pairwise(X, metric="cosine")

Top-k search (exact)

import numpy as np
from simmetry import topk

X = np.random.randn(5000, 64)
q = np.random.randn(64)
idx, scores = topk(q, X, k=10, metric="cosine")

Available Metrics

from simmetry import available

available()
available("vector")
available("string")
available("point")
available("set")

Vectors

  • cosine, dot, euclidean_sim, manhattan_sim, pearson

Strings

  • levenshtein (normalized similarity)
  • jaro_winkler
  • ngram_jaccard (character n-gram set Jaccard)
  • token_jaccard (whitespace token set Jaccard)

Points / Geo

  • euclidean_2d
  • haversine_km
  • pairwise_points
  • topk_points

Sets

  • jaccard, dice, overlap

Auto Metric Selection (Deterministic)

Auto mode is not random and not learned. It applies fixed type-based rules.

from simmetry import infer_metric, similarity

infer_metric("samplecorp", "sample corp")     # "jaro_winkler"
infer_metric((41.0, 29.0), (41.1, 29.1))      # "haversine_km"
infer_metric({1, 2, 3}, {2, 3, 4})            # "jaccard"

similarity("samplecorp", "sample corp")       # uses inferred metric

Selection order:

  1. list[str] / tuple[str] (including empty lists) -> batch strings (jaro_winkler)
  2. str + str -> jaro_winkler
  3. 2-number tuples/lists -> haversine_km
  4. set / frozenset -> jaccard
  5. numeric vectors -> cosine
  6. fallback -> cosine

Batch String APIs

from simmetry.strings import pairwise_strings, topk_strings

S = pairwise_strings(
    ["item_one", "item_two"],
    ["item_one", "item_alt"],
    metric="jaro_winkler",
)
idx, scores = topk_strings(
    "samplecorp",
    ["samplecorp", "examplefinance", "testgroup"],
    k=2,
    metric="levenshtein",
)

Batch Point APIs (Geo / 2D)

from simmetry.points import pairwise_points, topk_points

pts = [(41.0, 29.0), (41.01, 29.01), (40.9, 28.9)]
S = pairwise_points(pts, metric="haversine_km")
idx, scores = topk_points((41.0, 29.0), pts, k=2, metric="haversine_km")

ANN Top-k (Optional)

For very large vector corpora (100k+), exact topk() can be slow.

hnswlib

import numpy as np
from simmetry.ann import build_hnsw

X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)

index = build_hnsw(X, space="cosine")
labels, distances = index.query(X[0], k=10)

faiss

import numpy as np
from simmetry.ann import build_faiss

X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)

index = build_faiss(X, metric="ip")
labels, scores = index.query(X[0], k=10)

SimIndex (Exact or ANN)

import numpy as np
from simmetry import SimIndex

X = np.random.randn(50_000, 128).astype("float32")
index = SimIndex(metric="cosine", backend="exact").add(X)
idx, scores = index.query(X[0], k=10)

Composite Records

from simmetry import similarity

a = {"name": "Entity One", "city": "CityAlpha", "loc": (41.0, 29.0)}
b = {"name": "Entity One Extended", "city": "CityAlpha", "loc": (41.01, 28.99)}

score = similarity(
    a,
    b,
    metric={"name": "jaro_winkler", "loc": "haversine_km"},
    weights={"name": 0.7, "loc": 0.3},
)

Benchmarks

The project includes a benchmark harness in bench/run.py. Comparative benchmarks against rapidfuzz, scikit-learn, and ANN libraries are not published yet.

Run locally:

python bench/run.py

Scope and Roadmap

Current focus is a compact core with predictable APIs and optional ANN.

Planned additions (not implemented yet):

  • String metrics: Hamming, BM25-style text ranking helpers, string-level Sorensen-Dice variants
  • Published comparative benchmarks (RapidFuzz / sklearn / faiss baselines)
  • Hosted docs site

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simmetry-1.0.3.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simmetry-1.0.3-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file simmetry-1.0.3.tar.gz.

File metadata

  • Download URL: simmetry-1.0.3.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simmetry-1.0.3.tar.gz
Algorithm Hash digest
SHA256 d019f8b36b83b7039bab3fa3ec7188f32b76be90e5539268877ac8d6c653f364
MD5 7f57cdd34b37b547385288a35595a5b0
BLAKE2b-256 967a837426c0077f4125dd8360fdb771098227e2336739d69d451cddcbc8fe59

See more details on using hashes here.

Provenance

The following attestation bundles were made for simmetry-1.0.3.tar.gz:

Publisher: publish-pypi.yml on algumusrende/simmetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simmetry-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: simmetry-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simmetry-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 503d184bfbd2914a14856155771d2c1ba38b53615458fb5f0ac2ce0f4d16f56b
MD5 a1bcbd435364f137ea2a82e4510fdab0
BLAKE2b-256 efa55d6165775a3de0db646355515b79a7ad95252c6fdfeccc33475cca9b6212

See more details on using hashes here.

Provenance

The following attestation bundles were made for simmetry-1.0.3-py3-none-any.whl:

Publisher: publish-pypi.yml on algumusrende/simmetry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page