Skip to main content

SIMI — a similarity and text-analysis engine: 8 algorithms plus intent-aware routing for matching, dedup, spam, and bot protection

Project description

SIMI — a Similarity & Text-Analysis Engine for Python

Python bindings for SIMI, a production-grade similarity and text-analysis toolkit powered by PyO3 — a Rust core with the ergonomics of a plain Python module. Use it to build and integrate reliable similarity checks across real workloads: bot/abuse protection, spam & content moderation, record matching, deduplication, search ranking, and fuzzy input handling.

  • 8 battle-tested algorithms behind one clean API (edit distance, name matching, set overlap, document fingerprinting, probabilistic retrieval) — every score normalized to [0.0, 1.0].
  • SimiFlow routing — tell it your intent (names, typos, codes, documents, dedup, auto) and it picks the right algorithm for you.
  • Confidence cascade — resolve clear matches/mismatches with a cheap fast pass and escalate only the ambiguous middle to a heavier algorithm.
  • Native speed — algorithm calls run at Rust speed with tiny FFI overhead.
import simi

sf = simi.SimiFlow()
# Declare what you're comparing; SIMI routes "names" to Jaro-Winkler and runs it natively.
sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}

A note on origin. SIMI grew out of a need to cut the cost, latency, and unpredictability of using an LLM for every "are these the same?" decision. Most of those checks are deterministic and belong in fast, testable local code — which is exactly what SIMI provides.

Installation

pip install simi-flow

Requires Python 3.8 or later.

Algorithms

SIMI exposes every algorithm as a standalone function. All similarity functions return a normalized score in [0.0, 1.0] where 1.0 = identical.

Levenshtein (edit distance)

import simi

# Raw distance
simi.levenshtein_distance("kitten", "sitting")  # 3

# Normalized similarity
simi.levenshtein_similarity("kitten", "sitting")  # 0.571

Jaro-Winkler (names and short strings)

simi.jaro_winkler_similarity("MARTHA", "MARHTA")  # 0.961
simi.jaro_winkler_similarity("DWAYNE", "DUANE")   # 0.840

Hamming (equal-length codes)

Raises ValueError if the strings have different lengths.

simi.hamming_distance("karolin", "kathrin")        # 3
simi.hamming_similarity("karolin", "kathrin")      # 0.571
simi.hamming_similarity("hello", "hello")          # 1.0

Jaccard (n-grams and word sets)

# Configurable n-gram size
simi.jaccard_similarity("hello", "hallo", n=2)

# Convenience functions
simi.jaccard_bigram_similarity("hello", "hallo")
simi.jaccard_trigram_similarity("hello", "hallo")
simi.jaccard_word_similarity("the quick brown fox", "the quick lazy dog")

MinHash (document fingerprinting)

# Get a 128-hash signature
sig = simi.minhash_signature("large document text...", shingle_size=3, num_hashes=128)

# Compare with custom parameters
simi.minhash_similarity(a, b, shingle_size=3, num_hashes=128)

# Compare with defaults (shingle=3, hashes=128)
simi.minhash_similarity_default(a, b)

SimHash (64-bit LSH fingerprints)

# Get a 64-bit fingerprint
fp = simi.simhash_fingerprint("document text", shingle_size=4)
fp = simi.simhash_fingerprint_default("document text")  # shingle_size=4

# Compare
simi.simhash_similarity(a, b, shingle_size=4)
simi.simhash_similarity_default(a, b)

BM25 (probabilistic retrieval)

simi.bm25_similarity("the quick brown fox", "the quick blue fox")  # 0.5..0.8
simi.bm25_similarity("the quick brown fox", "the quick brown fox")  # 1.0

TF-IDF + Cosine (term-weighted vectors)

simi.tfidf_similarity("the quick brown fox", "the quick blue fox")  # 0.5..0.7
simi.tfidf_similarity("abc", "xyz")                                  # 0.0

Preprocessing

Normalize text before comparison to reduce noise:

# Quick one-liner
simi.clean_text("  Hello   World!  ")          # "hello world!"
simi.clean_text_stopwords("the quick brown fox")  # "quick brown fox"

# Builder pattern
from simi import Preprocessor

pre = Preprocessor() \
    .with_lowercase(True) \
    .with_collapse_whitespace(True) \
    .with_trim(True) \
    .with_normalize_unicode(True) \
    .with_remove_stopwords(True)

cleaned = pre.process("The Quick Brown Fox")
# "quick brown fox"

Available builder options:

  • with_lowercase(bool)
  • with_collapse_whitespace(bool)
  • with_trim(bool)
  • with_normalize_unicode(bool)
  • with_remove_stopwords(bool)
  • with_stopwords(list[str]) -- custom stopword list
  • with_max_length(int)

SimiFlow Router

The headline feature. Two ways to use it:

  1. Intent routing (compare_with_intent) — say what you're comparing, get the right algorithm.
  2. Cascade (tier_1tier_2) — answer confident cases with a cheap algorithm, escalate only the ambiguous middle to a heavier local pass. You inspect the result tier to see how often the expensive path runs — and route those few gray-zone cases to your own LLM call.

The router cascades through algorithms based on confidence thresholds, avoiding expensive computation until it is actually needed:

from simi import SimiFlow

sf = SimiFlow() \
    .preprocess(True) \
    .tier_1("jaro_winkler", "gt", 0.95, "lt", 0.10) \
    .tier_2("bm25", "between", 0.60, 0.94)

result = sf.compare("MARTHA", "MARHTA")
# {
#   "score": 0.961,
#   "tier": 1,
#   "algorithm": "jaro_winkler",
#   "fallback_called": False,
#   "fallback_data": None,
# }

Algorithm names for the router: "levenshtein", "jaro_winkler", "hamming", "jaccard_bigram", "jaccard_trigram", "jaccard_word", "minhash_default", "simhash_default", "bm25", "tfidf".

Threshold operators: "gt" (greater than), "lt" (less than), "between" (inclusive range, for Tier 2).

compare_with_intent

Bypass tier configuration and run a specific algorithm by intent:

sf = simi.SimiFlow()

# Intent-based: Names -> Jaro-Winkler
result = sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}

# Auto: inspects input lengths and picks automatically
result = sf.compare_with_intent("auto", a, b)

# All intents: names, typos, codes, documents, dedup/duplication, auto

Performance

SIMI is built in Rust with PyO3, so algorithm calls run at native speed:

Algorithm Input Time
Levenshtein "kitten"/"sitting" ~80 ns
Jaro-Winkler "MARTHA"/"MARHTA" ~200 ns
Hamming 7-char equal ~150 ns
Jaccard bigram Short texts ~1.7 us
MinHash (128) Short doc ~17 us
SimHash Short doc ~5 us
BM25 Short docs ~2.9 us
TF-IDF Short texts ~2.7 us

These timings are from the Rust core. The Python binding adds a small FFI overhead per call (~50-200 ns).

License

MIT -- see the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simi_flow-0.1.1.tar.gz (45.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simi_flow-0.1.1-cp312-cp312-win_amd64.whl (242.3 kB view details)

Uploaded CPython 3.12Windows x86-64

simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl (361.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (678.0 kB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simi_flow-0.1.1.tar.gz.

File metadata

  • Download URL: simi_flow-0.1.1.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for simi_flow-0.1.1.tar.gz
Algorithm Hash digest
SHA256 56f1bb95184bb42897acd0d3544cd774afec0645b24e4732b88d49f679a2796d
MD5 1aa90aaf8e9135ef33d6909f0d62be05
BLAKE2b-256 7bd8daeb1cda1479838210f154037c9fc2c22876ee455e6ca091cf03623bf6d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.1.tar.gz:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: simi_flow-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 242.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for simi_flow-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e738171d986074837781a611e0e88a4c94c322f118dc29d6ec14a78fb1a9d656
MD5 4b2d81ee070159b4712fade9a610ab9d
BLAKE2b-256 5490a7ae5ee38a5e44d0662c58b3176a4feb874079a1ebc383e6d6a27bad0ba0

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-win_amd64.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 42e7572e47ca4a87fb931b9adfeba67012faa90a8bcd1fd574e3fd149db35e4a
MD5 146a86b244b8ec1a2f0e0d985aab5f27
BLAKE2b-256 50a0e0906eb5e3e1cafde331f208100c17a6e72ea1dd15fdb9be981a7e0ec39b

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 6b234f232d33b04630d66779a410020d5a782f96ff1d99525cf9015977104b70
MD5 0956535d562db102adac5d1a9496dfbb
BLAKE2b-256 b6ad21ebb23bcd394093e723e12f4667fb7d18610ffb15fa7d8e47f92e2fcbdf

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page