Blazing-fast similarity scores for strings, vectors, points, and sets.
Project description
simmetry
Similarity scores for strings, vectors, points, and sets with a small, NumPy-first API.
Install
pip install simmetry
pip install "simmetry[fast]"
simmetry[fast]: enables optional Numba acceleration forpairwise(..., metric="euclidean_sim")andpairwise(..., metric="manhattan_sim")- ANN extras:
pip install "simmetry[ann-hnsw]"pip install "simmetry[ann-faiss]"
Project Status
- Current package:
simmetryon PyPI - Current version in this repo:
1.0.3 - Maturity: Alpha (API may change; pin exact/minor versions in production)
- Versioning: semantic versioning target, but pre-hardening changes may still occur in minor releases until
1.xstabilizes
Quickstart
One function
from simmetry import similarity
similarity("kitten", "sitting", metric="levenshtein")
similarity([1, 2, 3], [1, 2, 4], metric="cosine")
similarity((41.1, 29.0), (41.2, 29.1), metric="haversine_km")
similarity({1, 2, 3}, {2, 3, 4}, metric="jaccard")
haversine_km returns geographic distance in kilometers.
Pairwise matrices (vectors)
import numpy as np
from simmetry import pairwise
X = np.random.randn(1000, 128)
S = pairwise(X, metric="cosine")
Top-k search (exact)
import numpy as np
from simmetry import topk
X = np.random.randn(5000, 64)
q = np.random.randn(64)
idx, scores = topk(q, X, k=10, metric="cosine")
Available Metrics
from simmetry import available
available()
available("vector")
available("string")
available("point")
available("set")
Vectors
cosine,dot,euclidean_sim,manhattan_sim,pearson
Strings
levenshtein(normalized similarity)jaro_winklerngram_jaccard(character n-gram set Jaccard)token_jaccard(whitespace token set Jaccard)
Points / Geo
euclidean_2dhaversine_kmpairwise_pointstopk_points
Sets
jaccard,dice,overlap
Auto Metric Selection (Deterministic)
Auto mode is not random and not learned. It applies fixed type-based rules.
from simmetry import infer_metric, similarity
infer_metric("samplecorp", "sample corp") # "jaro_winkler"
infer_metric((41.0, 29.0), (41.1, 29.1)) # "haversine_km"
infer_metric({1, 2, 3}, {2, 3, 4}) # "jaccard"
similarity("samplecorp", "sample corp") # uses inferred metric
Selection order:
list[str]/tuple[str](including empty lists) -> batch strings (jaro_winkler)str+str->jaro_winkler- 2-number tuples/lists ->
haversine_km set/frozenset->jaccard- numeric vectors ->
cosine - fallback ->
cosine
Batch String APIs
from simmetry.strings import pairwise_strings, topk_strings
S = pairwise_strings(
["item_one", "item_two"],
["item_one", "item_alt"],
metric="jaro_winkler",
)
idx, scores = topk_strings(
"samplecorp",
["samplecorp", "examplefinance", "testgroup"],
k=2,
metric="levenshtein",
)
Batch Point APIs (Geo / 2D)
from simmetry.points import pairwise_points, topk_points
pts = [(41.0, 29.0), (41.01, 29.01), (40.9, 28.9)]
S = pairwise_points(pts, metric="haversine_km")
idx, scores = topk_points((41.0, 29.0), pts, k=2, metric="haversine_km")
ANN Top-k (Optional)
For very large vector corpora (100k+), exact topk() can be slow.
hnswlib
import numpy as np
from simmetry.ann import build_hnsw
X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)
index = build_hnsw(X, space="cosine")
labels, distances = index.query(X[0], k=10)
faiss
import numpy as np
from simmetry.ann import build_faiss
X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)
index = build_faiss(X, metric="ip")
labels, scores = index.query(X[0], k=10)
SimIndex (Exact or ANN)
import numpy as np
from simmetry import SimIndex
X = np.random.randn(50_000, 128).astype("float32")
index = SimIndex(metric="cosine", backend="exact").add(X)
idx, scores = index.query(X[0], k=10)
Composite Records
from simmetry import similarity
a = {"name": "Entity One", "city": "CityAlpha", "loc": (41.0, 29.0)}
b = {"name": "Entity One Extended", "city": "CityAlpha", "loc": (41.01, 28.99)}
score = similarity(
a,
b,
metric={"name": "jaro_winkler", "loc": "haversine_km"},
weights={"name": 0.7, "loc": 0.3},
)
Benchmarks
The project includes a benchmark harness in bench/run.py. Comparative benchmarks against rapidfuzz, scikit-learn, and ANN libraries are not published yet.
Run locally:
python bench/run.py
Scope and Roadmap
Current focus is a compact core with predictable APIs and optional ANN.
Planned additions (not implemented yet):
- String metrics: Hamming, BM25-style text ranking helpers, string-level Sorensen-Dice variants
- Published comparative benchmarks (RapidFuzz / sklearn / faiss baselines)
- Hosted docs site
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simmetry-1.0.3.tar.gz.
File metadata
- Download URL: simmetry-1.0.3.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d019f8b36b83b7039bab3fa3ec7188f32b76be90e5539268877ac8d6c653f364
|
|
| MD5 |
7f57cdd34b37b547385288a35595a5b0
|
|
| BLAKE2b-256 |
967a837426c0077f4125dd8360fdb771098227e2336739d69d451cddcbc8fe59
|
Provenance
The following attestation bundles were made for simmetry-1.0.3.tar.gz:
Publisher:
publish-pypi.yml on algumusrende/simmetry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simmetry-1.0.3.tar.gz -
Subject digest:
d019f8b36b83b7039bab3fa3ec7188f32b76be90e5539268877ac8d6c653f364 - Sigstore transparency entry: 1108180813
- Sigstore integration time:
-
Permalink:
algumusrende/simmetry@53fc919d7e004de9c4c00e335d12d24d712b543e -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/algumusrende
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@53fc919d7e004de9c4c00e335d12d24d712b543e -
Trigger Event:
release
-
Statement type:
File details
Details for the file simmetry-1.0.3-py3-none-any.whl.
File metadata
- Download URL: simmetry-1.0.3-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
503d184bfbd2914a14856155771d2c1ba38b53615458fb5f0ac2ce0f4d16f56b
|
|
| MD5 |
a1bcbd435364f137ea2a82e4510fdab0
|
|
| BLAKE2b-256 |
efa55d6165775a3de0db646355515b79a7ad95252c6fdfeccc33475cca9b6212
|
Provenance
The following attestation bundles were made for simmetry-1.0.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on algumusrende/simmetry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simmetry-1.0.3-py3-none-any.whl -
Subject digest:
503d184bfbd2914a14856155771d2c1ba38b53615458fb5f0ac2ce0f4d16f56b - Sigstore transparency entry: 1108180818
- Sigstore integration time:
-
Permalink:
algumusrende/simmetry@53fc919d7e004de9c4c00e335d12d24d712b543e -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/algumusrende
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@53fc919d7e004de9c4c00e335d12d24d712b543e -
Trigger Event:
release
-
Statement type: