Skip to main content

SIMI — a similarity and text-analysis engine: 8 algorithms plus intent-aware routing for matching, dedup, spam, and bot protection

Project description

SIMI — a Similarity & Text-Analysis Engine for Python

PyPI Python versions CI License: MIT

Python bindings for SIMI, a production-grade similarity and text-analysis toolkit powered by PyO3 — a Rust core with the ergonomics of a plain Python module. Use it to build and integrate reliable similarity checks across real workloads: bot/abuse protection, spam & content moderation, record matching, deduplication, search ranking, and fuzzy input handling.

  • 8 battle-tested algorithms behind one clean API (edit distance, name matching, set overlap, document fingerprinting, probabilistic retrieval) — every score normalized to [0.0, 1.0].
  • SimiFlow routing — tell it your intent (names, typos, codes, documents, dedup, auto) and it picks the right algorithm for you.
  • Confidence cascade — resolve clear matches/mismatches with a cheap fast pass and escalate only the ambiguous middle to a heavier algorithm.
  • Native speed — algorithm calls run at Rust speed with tiny FFI overhead.
import simi

sf = simi.SimiFlow()
# Declare what you're comparing; SIMI routes "names" to Jaro-Winkler and runs it natively.
sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}

A note on origin. SIMI grew out of a need to cut the cost, latency, and unpredictability of using an LLM for every "are these the same?" decision. Most of those checks are deterministic and belong in fast, testable local code — which is exactly what SIMI provides.

Installation

pip install simi-flow

Requires Python 3.8 or later.

Algorithms

SIMI exposes every algorithm as a standalone function. All similarity functions return a normalized score in [0.0, 1.0] where 1.0 = identical.

Levenshtein (edit distance)

import simi

# Raw distance
simi.levenshtein_distance("kitten", "sitting")  # 3

# Normalized similarity
simi.levenshtein_similarity("kitten", "sitting")  # 0.571

Jaro-Winkler (names and short strings)

simi.jaro_winkler_similarity("MARTHA", "MARHTA")  # 0.961
simi.jaro_winkler_similarity("DWAYNE", "DUANE")   # 0.840

Hamming (equal-length codes)

Raises ValueError if the strings have different lengths.

simi.hamming_distance("karolin", "kathrin")        # 3
simi.hamming_similarity("karolin", "kathrin")      # 0.571
simi.hamming_similarity("hello", "hello")          # 1.0

Jaccard (n-grams and word sets)

# Configurable n-gram size
simi.jaccard_similarity("hello", "hallo", n=2)

# Convenience functions
simi.jaccard_bigram_similarity("hello", "hallo")
simi.jaccard_trigram_similarity("hello", "hallo")
simi.jaccard_word_similarity("the quick brown fox", "the quick lazy dog")

MinHash (document fingerprinting)

# Get a 128-hash signature
sig = simi.minhash_signature("large document text...", shingle_size=3, num_hashes=128)

# Compare with custom parameters
simi.minhash_similarity(a, b, shingle_size=3, num_hashes=128)

# Compare with defaults (shingle=3, hashes=128)
simi.minhash_similarity_default(a, b)

SimHash (64-bit LSH fingerprints)

# Get a 64-bit fingerprint
fp = simi.simhash_fingerprint("document text", shingle_size=4)
fp = simi.simhash_fingerprint_default("document text")  # shingle_size=4

# Compare
simi.simhash_similarity(a, b, shingle_size=4)
simi.simhash_similarity_default(a, b)

BM25 (probabilistic retrieval)

simi.bm25_similarity("the quick brown fox", "the quick blue fox")  # 0.5..0.8
simi.bm25_similarity("the quick brown fox", "the quick brown fox")  # 1.0

TF-IDF + Cosine (term-weighted vectors)

simi.tfidf_similarity("the quick brown fox", "the quick blue fox")  # 0.5..0.7
simi.tfidf_similarity("abc", "xyz")                                  # 0.0

Preprocessing

Normalize text before comparison to reduce noise:

# Quick one-liner
simi.clean_text("  Hello   World!  ")          # "hello world!"
simi.clean_text_stopwords("the quick brown fox")  # "quick brown fox"

# Builder pattern
from simi import Preprocessor

pre = Preprocessor() \
    .with_lowercase(True) \
    .with_collapse_whitespace(True) \
    .with_trim(True) \
    .with_normalize_unicode(True) \
    .with_remove_stopwords(True)

cleaned = pre.process("The Quick Brown Fox")
# "quick brown fox"

Available builder options:

  • with_lowercase(bool)
  • with_collapse_whitespace(bool)
  • with_trim(bool)
  • with_normalize_unicode(bool)
  • with_remove_stopwords(bool)
  • with_stopwords(list[str]) -- custom stopword list
  • with_max_length(int)

SimiFlow Router

The headline feature. Two ways to use it:

  1. Intent routing (compare_with_intent) — say what you're comparing, get the right algorithm.
  2. Cascade (tier_1tier_2) — answer confident cases with a cheap algorithm, escalate only the ambiguous middle to a heavier local pass. You inspect the result tier to see how often the expensive path runs — and route those few gray-zone cases to your own LLM call.

The router cascades through algorithms based on confidence thresholds, avoiding expensive computation until it is actually needed:

from simi import SimiFlow

sf = SimiFlow() \
    .preprocess(True) \
    .tier_1("jaro_winkler", "gt", 0.95, "lt", 0.10) \
    .tier_2("bm25", "between", 0.60, 0.94)

result = sf.compare("MARTHA", "MARHTA")
# {
#   "score": 0.961,
#   "tier": 1,
#   "algorithm": "jaro_winkler",
#   "fallback_called": False,
#   "fallback_data": None,
# }

Algorithm names for the router: "levenshtein", "jaro_winkler", "hamming", "jaccard_bigram", "jaccard_trigram", "jaccard_word", "minhash_default", "simhash_default", "bm25", "tfidf".

Threshold operators: "gt" (greater than), "lt" (less than), "between" (inclusive range, for Tier 2).

compare_with_intent

Bypass tier configuration and run a specific algorithm by intent:

sf = simi.SimiFlow()

# Intent-based: Names -> Jaro-Winkler
result = sf.compare_with_intent("names", "MARTHA", "MARHTA")
# {'score': 0.961, 'tier': 0, 'algorithm': 'jaro_winkler', ...}

# Auto: inspects input lengths and picks automatically
result = sf.compare_with_intent("auto", a, b)

# All intents: names, typos, codes, documents, dedup/duplication, auto

Performance

SIMI is built in Rust with PyO3, so algorithm calls run at native speed:

Algorithm Input Time
Levenshtein "kitten"/"sitting" ~80 ns
Jaro-Winkler "MARTHA"/"MARHTA" ~200 ns
Hamming 7-char equal ~150 ns
Jaccard bigram Short texts ~1.7 us
MinHash (128) Short doc ~17 us
SimHash Short doc ~5 us
BM25 Short docs ~2.9 us
TF-IDF Short texts ~2.7 us

These timings are from the Rust core. The Python binding adds a small FFI overhead per call (~50-200 ns).

License

MIT -- see the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simi_flow-0.1.2.tar.gz (46.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simi_flow-0.1.2-cp312-cp312-win_amd64.whl (242.5 kB view details)

Uploaded CPython 3.12Windows x86-64

simi_flow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl (361.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

simi_flow-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (678.3 kB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simi_flow-0.1.2.tar.gz.

File metadata

  • Download URL: simi_flow-0.1.2.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for simi_flow-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2d04f9f41f9bb090526fce9da9a0acdf11c333b18c6152967f1b202a5849960d
MD5 6014766732ae30b0808c367178620e92
BLAKE2b-256 b7b888d5af78fa5ee1f4e464741eb7f392413a46217b29f1c36860b7e8470d00

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.2.tar.gz:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: simi_flow-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 242.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for simi_flow-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ef319e4892bc48df63c4cf203b274017c09fc26b20bbc7585f1aebff59422e17
MD5 be6566f2b4441c695a2199f7aab099c1
BLAKE2b-256 4a880122ca1e229759a751d40e42c763abc7f5f013d6713ffa72d0542458fa5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simi_flow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6d5de2b93fd4aa8573a13f0c8c9badef60152d8bbbe9a23f3b4840b7eae29683
MD5 b2252537507fcc4530eefadfb7aee39d
BLAKE2b-256 84ca0296ee40827720b2c0847c4afb4f5bc55540cfff1f58df81934209e1b6cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simi_flow-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simi_flow-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 02b9487621a6eaf878c65f019997dc64c1f0ee84a54cbf717698b92e44e5e1f2
MD5 75c7d3b143509c29362c46c3156a2b31
BLAKE2b-256 960d1b98c060d1e1f6892424851d27008aea12291eda1aa9cd8364263f1407d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for simi_flow-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on siktec-lab/simi-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page