SIMI — a similarity and text-analysis engine: 8 algorithms plus intent-aware routing for matching, dedup, spam, and bot protection
Project description
SIMI — a Similarity & Text-Analysis Engine for Python
Python bindings for SIMI, a production-grade similarity and text-analysis toolkit powered by PyO3 — a Rust core with the ergonomics of a plain Python module. Use it to build and integrate reliable similarity checks across real workloads: bot/abuse protection, spam & content moderation, record matching, deduplication, search ranking, and fuzzy input handling.
- 8 battle-tested algorithms behind one clean API (edit distance, name matching, set overlap, document fingerprinting, probabilistic retrieval) — every score normalized to
[0.0, 1.0]. - SimiFlow routing — tell it your intent (
names,typos,codes,documents,dedup,auto) and it picks the right algorithm for you. - Confidence cascade — resolve clear matches/mismatches with a cheap fast pass and escalate only the ambiguous middle to a heavier algorithm.
- Native speed — algorithm calls run at Rust speed with tiny FFI overhead.
import simi
sf = simi.SimiFlow()
# Declare what you're comparing; SIMI routes "names" to Jaro-Winkler and runs it natively.
sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}
A note on origin. SIMI grew out of a need to cut the cost, latency, and unpredictability of using an LLM for every "are these the same?" decision. Most of those checks are deterministic and belong in fast, testable local code — which is exactly what SIMI provides.
Installation
pip install simi-flow
Requires Python 3.8 or later.
Algorithms
SIMI exposes every algorithm as a standalone function. All similarity
functions return a normalized score in [0.0, 1.0] where 1.0 = identical.
Levenshtein (edit distance)
import simi
# Raw distance
simi.levenshtein_distance("kitten", "sitting") # 3
# Normalized similarity
simi.levenshtein_similarity("kitten", "sitting") # 0.571
Jaro-Winkler (names and short strings)
simi.jaro_winkler_similarity("MARTHA", "MARHTA") # 0.961
simi.jaro_winkler_similarity("DWAYNE", "DUANE") # 0.840
Hamming (equal-length codes)
Raises ValueError if the strings have different lengths.
simi.hamming_distance("karolin", "kathrin") # 3
simi.hamming_similarity("karolin", "kathrin") # 0.571
simi.hamming_similarity("hello", "hello") # 1.0
Jaccard (n-grams and word sets)
# Configurable n-gram size
simi.jaccard_similarity("hello", "hallo", n=2)
# Convenience functions
simi.jaccard_bigram_similarity("hello", "hallo")
simi.jaccard_trigram_similarity("hello", "hallo")
simi.jaccard_word_similarity("the quick brown fox", "the quick lazy dog")
MinHash (document fingerprinting)
# Get a 128-hash signature
sig = simi.minhash_signature("large document text...", shingle_size=3, num_hashes=128)
# Compare with custom parameters
simi.minhash_similarity(a, b, shingle_size=3, num_hashes=128)
# Compare with defaults (shingle=3, hashes=128)
simi.minhash_similarity_default(a, b)
SimHash (64-bit LSH fingerprints)
# Get a 64-bit fingerprint
fp = simi.simhash_fingerprint("document text", shingle_size=4)
fp = simi.simhash_fingerprint_default("document text") # shingle_size=4
# Compare
simi.simhash_similarity(a, b, shingle_size=4)
simi.simhash_similarity_default(a, b)
BM25 (probabilistic retrieval)
simi.bm25_similarity("the quick brown fox", "the quick blue fox") # 0.5..0.8
simi.bm25_similarity("the quick brown fox", "the quick brown fox") # 1.0
TF-IDF + Cosine (term-weighted vectors)
simi.tfidf_similarity("the quick brown fox", "the quick blue fox") # 0.5..0.7
simi.tfidf_similarity("abc", "xyz") # 0.0
Preprocessing
Normalize text before comparison to reduce noise:
# Quick one-liner
simi.clean_text(" Hello World! ") # "hello world!"
simi.clean_text_stopwords("the quick brown fox") # "quick brown fox"
# Builder pattern
from simi import Preprocessor
pre = Preprocessor() \
.with_lowercase(True) \
.with_collapse_whitespace(True) \
.with_trim(True) \
.with_normalize_unicode(True) \
.with_remove_stopwords(True)
cleaned = pre.process("The Quick Brown Fox")
# "quick brown fox"
Available builder options:
with_lowercase(bool)with_collapse_whitespace(bool)with_trim(bool)with_normalize_unicode(bool)with_remove_stopwords(bool)with_stopwords(list[str])-- custom stopword listwith_max_length(int)
SimiFlow Router
The headline feature. Two ways to use it:
- Intent routing (
compare_with_intent) — say what you're comparing, get the right algorithm. - Cascade (
tier_1→tier_2) — answer confident cases with a cheap algorithm, escalate only the ambiguous middle to a heavier local pass. You inspect the resulttierto see how often the expensive path runs — and route those few gray-zone cases to your own LLM call.
The router cascades through algorithms based on confidence thresholds, avoiding expensive computation until it is actually needed:
from simi import SimiFlow
sf = SimiFlow() \
.preprocess(True) \
.tier_1("jaro_winkler", "gt", 0.95, "lt", 0.10) \
.tier_2("bm25", "between", 0.60, 0.94)
result = sf.compare("MARTHA", "MARHTA")
# {
# "score": 0.961,
# "tier": 1,
# "algorithm": "jaro_winkler",
# "fallback_called": False,
# "fallback_data": None,
# }
Algorithm names for the router: "levenshtein", "jaro_winkler",
"hamming", "jaccard_bigram", "jaccard_trigram", "jaccard_word",
"minhash_default", "simhash_default", "bm25", "tfidf".
Threshold operators: "gt" (greater than), "lt" (less than),
"between" (inclusive range, for Tier 2).
compare_with_intent
Bypass tier configuration and run a specific algorithm by intent:
sf = simi.SimiFlow()
# Intent-based: Names -> Jaro-Winkler
result = sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}
# Auto: inspects input lengths and picks automatically
result = sf.compare_with_intent("auto", a, b)
# All intents: names, typos, codes, documents, dedup/duplication, auto
Performance
SIMI is built in Rust with PyO3, so algorithm calls run at native speed:
| Algorithm | Input | Time |
|---|---|---|
| Levenshtein | "kitten"/"sitting" | ~80 ns |
| Jaro-Winkler | "MARTHA"/"MARHTA" | ~200 ns |
| Hamming | 7-char equal | ~150 ns |
| Jaccard bigram | Short texts | ~1.7 us |
| MinHash (128) | Short doc | ~17 us |
| SimHash | Short doc | ~5 us |
| BM25 | Short docs | ~2.9 us |
| TF-IDF | Short texts | ~2.7 us |
These timings are from the Rust core. The Python binding adds a small FFI overhead per call (~50-200 ns).
License
MIT -- see the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simi_flow-0.1.1.tar.gz.
File metadata
- Download URL: simi_flow-0.1.1.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56f1bb95184bb42897acd0d3544cd774afec0645b24e4732b88d49f679a2796d
|
|
| MD5 |
1aa90aaf8e9135ef33d6909f0d62be05
|
|
| BLAKE2b-256 |
7bd8daeb1cda1479838210f154037c9fc2c22876ee455e6ca091cf03623bf6d6
|
Provenance
The following attestation bundles were made for simi_flow-0.1.1.tar.gz:
Publisher:
release.yml on siktec-lab/simi-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simi_flow-0.1.1.tar.gz -
Subject digest:
56f1bb95184bb42897acd0d3544cd774afec0645b24e4732b88d49f679a2796d - Sigstore transparency entry: 1987403251
- Sigstore integration time:
-
Permalink:
siktec-lab/simi-flow@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/siktec-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Trigger Event:
push
-
Statement type:
File details
Details for the file simi_flow-0.1.1-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: simi_flow-0.1.1-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 242.3 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e738171d986074837781a611e0e88a4c94c322f118dc29d6ec14a78fb1a9d656
|
|
| MD5 |
4b2d81ee070159b4712fade9a610ab9d
|
|
| BLAKE2b-256 |
5490a7ae5ee38a5e44d0662c58b3176a4feb874079a1ebc383e6d6a27bad0ba0
|
Provenance
The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on siktec-lab/simi-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simi_flow-0.1.1-cp312-cp312-win_amd64.whl -
Subject digest:
e738171d986074837781a611e0e88a4c94c322f118dc29d6ec14a78fb1a9d656 - Sigstore transparency entry: 1987403447
- Sigstore integration time:
-
Permalink:
siktec-lab/simi-flow@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/siktec-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Trigger Event:
push
-
Statement type:
File details
Details for the file simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 361.2 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42e7572e47ca4a87fb931b9adfeba67012faa90a8bcd1fd574e3fd149db35e4a
|
|
| MD5 |
146a86b244b8ec1a2f0e0d985aab5f27
|
|
| BLAKE2b-256 |
50a0e0906eb5e3e1cafde331f208100c17a6e72ea1dd15fdb9be981a7e0ec39b
|
Provenance
The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl:
Publisher:
release.yml on siktec-lab/simi-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl -
Subject digest:
42e7572e47ca4a87fb931b9adfeba67012faa90a8bcd1fd574e3fd149db35e4a - Sigstore transparency entry: 1987403635
- Sigstore integration time:
-
Permalink:
siktec-lab/simi-flow@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/siktec-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Trigger Event:
push
-
Statement type:
File details
Details for the file simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 678.0 kB
- Tags: CPython 3.12, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b234f232d33b04630d66779a410020d5a782f96ff1d99525cf9015977104b70
|
|
| MD5 |
0956535d562db102adac5d1a9496dfbb
|
|
| BLAKE2b-256 |
b6ad21ebb23bcd394093e723e12f4667fb7d18610ffb15fa7d8e47f92e2fcbdf
|
Provenance
The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:
Publisher:
release.yml on siktec-lab/simi-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl -
Subject digest:
6b234f232d33b04630d66779a410020d5a782f96ff1d99525cf9015977104b70 - Sigstore transparency entry: 1987403841
- Sigstore integration time:
-
Permalink:
siktec-lab/simi-flow@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/siktec-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4b94d3e74cae751f4dfcf1be13915e85605563e -
Trigger Event:
push
-
Statement type: