Skip to main content

Deterministic approximate vectorization for short identifiers and labels

Project description

idri — deterministic approximate vectorization for short identifiers and labels

Buy Me a Coffee

idri is a tiny low-level Python vectorization engine for short, noisy identifiers and labels.

It is designed for reproducible approximate matching over short strings, and for composing multiple identifier signals into a single vector before retrieval.

What It Is Good At

  • Short business identifiers
  • Names and tags
  • Prefix/suffix-heavy codes
  • OCR-truncated fields
  • Partial multiword labels

Examples:

  • ACME INDUSTRIAL SA DE CV vs ACME INDUSTRIAL
  • INV-2024-001-03 vs INV-2024-001
  • TOPER PULIDORA 4 1/2 vs TOPER PULIDORA 4 1 2

What It Is Not For

  • Semantic similarity
  • Long natural-language embeddings
  • Full entity resolution workflows
  • Business-rule matching engines

Determinism and Distributed Use

idri is deterministic by construction: the same normalized input, profile, family, dimension, seed, and library behavior should produce the same vector.

That makes it distributed-friendly for systems that need reproducible encoding across workers, services, or edge nodes. This is a reproducibility property, not yet a fully validated distributed-systems guarantee.

Use Cases

idri is strongest when the problem is short-string matching, candidate generation, clustering, or blocking rather than semantic understanding.

  • E-commerce and retail catalog deduplication for messy product titles, vendor labels, and SKU-like strings
  • Procurement and ERP item crosswalks across supplier catalogs and internal part masters
  • Invoice and payment-reference matching for noisy parent/child identifiers
  • Marketplace or PIM listing clustering before rules or human review
  • Media tag and label search over large collections
  • CRM and analytics normalization for campaign names, UTM values, and source labels
  • Fraud and payments clustering for noisy merchant descriptors
  • IoT and edge-device matching where local deterministic encoding matters more than a cloud model

Why This Exists Instead of Just Fuzzy Matching or TF-IDF

Compared with classic fuzzy matching:

  • vectors can be precomputed and indexed once instead of rescoring every candidate pair
  • weighted composition lets you match multi-field records in one ANN search instead of several rule stages
  • word plus character n-grams preserve robustness to OCR truncation, token reordering, and small textual variation

Compared with TF-IDF:

  • no vocabulary fitting at runtime
  • no corpus dependency
  • same output dimension always
  • easier ANN integration
  • easier deployment
  • simpler persistence

Additional practical advantages:

  • deterministic output for the same input under the same encoder settings
  • can be used online with zero fitting
  • easy to port to Rust later
  • compact fixed-size vectors
  • supports lightweight family-aware encoding
  • supports weighted composition of multiple identifier signals into one query vector

Complex Matching in One Pass

Weighted composition is a practical advantage, not just an API feature.

Example:

  • specific invoice id: x2437-1
  • parent invoice family: x2437

You can encode both and compose them into one retrieval vector:

from idri import IdentifierEncoder, compose_vectors

enc = IdentifierEncoder()

specific = enc.encode("x2437-1", family="invoice")
parent = enc.encode("x2437", family="invoice")

query = compose_vectors([
    (specific, 0.7),
    (parent, 0.3),
])

That lets a downstream ANN index perform one search over a vector that captures both the specific invoice and its family context.

The same pattern works for provider or merchant names. Word features keep strong token identity, while character n-grams add protection against OCR truncation, spacing variation, and partial suffix loss.

Install

uv sync

Quick Start

from idri import IdentifierEncoder, normalize_text

enc = IdentifierEncoder(profile="word_ngram")

vec = enc.encode("ACME Industrial SA de CV")
score = enc.similarity("ACME Industrial", "ACME Industrial SA de CV")
explanation = enc.explain("ACME Industrial", "ACME Industrial SA de CV")

print(normalize_text("  ACME   Industrial SA de CV  "))
print(vec.shape)              # (2048,)
print(type(vec).__name__)     # ndarray
print(score)
print(explanation)

Family-Aware Encoding

from idri import IdentifierEncoder

enc = IdentifierEncoder()
vec = enc.encode("INV-2024-001-03", family="invoice")
score = enc.similarity("INV-2024-001", "INV-2024-001-03", family="invoice")

If family=None, idri uses generic.

Composition Is First-Class

from idri import IdentifierEncoder, compose_vectors, cosine_similarity, l2_distance

enc = IdentifierEncoder()

v1 = enc.encode("F0032-3", family="invoice")
v2 = enc.encode("F0032", family="invoice")

composed = compose_vectors([(v1, 0.6), (v2, 0.4)])

composer = enc.start_composer()
composer.add_text("F0032-3", weight=0.6, family="invoice")
composer.add_text("F0032", weight=0.4, family="invoice")
incremental = composer.build()

cos_score = cosine_similarity(composed, incremental)
euclidean_gap = l2_distance(composed, incremental)

Profiles

  • word: family + words only
  • word_ngram (default): family + words + character n-grams
  • word_ngram_position: same as word_ngram with weak positional decay

API Summary

  • IdentifierEncoder.encode(...) -> numpy.ndarray
  • IdentifierEncoder.batch_encode(...) -> numpy.ndarray
  • IdentifierEncoder.similarity(...) -> float
  • IdentifierEncoder.start_composer(...) -> VectorComposer
  • IdentifierEncoder.explain(...) -> dict
  • compose_vectors(...) -> numpy.ndarray
  • cosine_similarity(...) -> float
  • l2_distance(...) -> float
  • l2_normalize(...) -> numpy.ndarray
  • available_profiles() -> tuple[str, ...]
  • normalize_text(...) -> str

All vectors are dense numpy.ndarray outputs (default float32) and are L2-normalized at output boundaries.

Exact Cosine vs ANN Search

Exact cosine similarity computed by this library over normalized vectors is the reference behavior.

ANN/vector database search runs in the same vector space but may not return bit-identical rankings. Approximate indexes can reorder near-ties and sometimes change lower-ranked results depending on index settings.

Development

uv sync
uv run pytest
uv run pytest --cov=idri --cov-report=term-missing
uv run python examples/basic_usage.py
uv run python examples/composition_usage.py
uv build

PyPI Release

This repository is configured to publish idri to PyPI through GitHub Actions Trusted Publishing.

Workflow details:

  • repository owner: gocova
  • repository name: idri_py
  • workflow filename: publish.yml
  • GitHub environment: pypi
  • PyPI project name: idri

PyPI setup:

  1. If idri does not exist yet on PyPI, create a pending publisher for project name idri.
  2. If idri already exists on PyPI, add a Trusted Publisher to that project with the workflow details above.
  3. Push a version tag such as v0.1.0 to trigger the release workflow.

The workflow builds the package with uv build, runs the test suite, and publishes with pypa/gh-action-pypi-publish using GitHub OIDC. No long-lived PyPI API token is required.

License

This project is open source under the Mozilla Public License 2.0 (MPL-2.0).
See LICENSE.

Consulting / Commercial Terms

For custom integration, proprietary licensing without attribution, or high-performance optimization, contact the maintainer for consulting.

Contributing

Contributions are welcome, but they are subject to the contributor terms in CLA.md. See CONTRIBUTING.md for workflow and test requirements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idri-1.0.0rc2603092116.dev0.tar.gz (60.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idri-1.0.0rc2603092116.dev0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file idri-1.0.0rc2603092116.dev0.tar.gz.

File metadata

  • Download URL: idri-1.0.0rc2603092116.dev0.tar.gz
  • Upload date:
  • Size: 60.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for idri-1.0.0rc2603092116.dev0.tar.gz
Algorithm Hash digest
SHA256 cf85c6a734ff707e839588adb65011f20ab359651278a85b04ef7d5b5a1b71cc
MD5 c1656e8897559c7ef97c0126d7c24a47
BLAKE2b-256 beedf4440034e676a883ab8722c9efd959f1b2dc2795e204e063c1e2395ebbe0

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-1.0.0rc2603092116.dev0.tar.gz:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file idri-1.0.0rc2603092116.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for idri-1.0.0rc2603092116.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 1beeedb4883e8b3265be4bf6056cd5d0b75606d09aff89cc4208b17466cb3f14
MD5 cb67483ff51adaaa69a0ceb1fb6d0af5
BLAKE2b-256 5f44c7d2db0476f3b3c7dfe078a811f35e77756d46f21ebaab9ecd8f7c1dda9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-1.0.0rc2603092116.dev0-py3-none-any.whl:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page