Skip to main content

Deterministic approximate vectorization for short identifiers and labels

Project description

idri — deterministic approximate vectorization for short identifiers and labels

Buy Me a Coffee

idri is a tiny low-level Python vectorization engine for short, noisy identifiers and labels.

It is designed for reproducible approximate matching over short strings, and for composing multiple identifier signals into a single vector before retrieval.

What It Is Good At

  • Short business identifiers
  • Names and tags
  • Prefix/suffix-heavy codes
  • OCR-truncated fields
  • Partial multiword labels

Examples:

  • ACME INDUSTRIAL SA DE CV vs ACME INDUSTRIAL
  • INV-2024-001-03 vs INV-2024-001
  • TOPER PULIDORA 4 1/2 vs TOPER PULIDORA 4 1 2

What It Is Not For

  • Semantic similarity
  • Long natural-language embeddings
  • Full entity resolution workflows
  • Business-rule matching engines

Determinism and Distributed Use

idri is deterministic by construction: the same normalized input, profile, family, dimension, seed, and library behavior should produce the same vector.

That makes it distributed-friendly for systems that need reproducible encoding across workers, services, or edge nodes. This is a reproducibility property, not yet a fully validated distributed-systems guarantee.

Use Cases

idri is strongest when the problem is short-string matching, candidate generation, clustering, or blocking rather than semantic understanding.

  • E-commerce and retail catalog deduplication for messy product titles, vendor labels, and SKU-like strings
  • Procurement and ERP item crosswalks across supplier catalogs and internal part masters
  • Invoice and payment-reference matching for noisy parent/child identifiers
  • Marketplace or PIM listing clustering before rules or human review
  • Media tag and label search over large collections
  • CRM and analytics normalization for campaign names, UTM values, and source labels
  • Fraud and payments clustering for noisy merchant descriptors
  • IoT and edge-device matching where local deterministic encoding matters more than a cloud model

Why This Exists Instead of Just Fuzzy Matching or TF-IDF

Compared with classic fuzzy matching:

  • vectors can be precomputed and indexed once instead of rescoring every candidate pair
  • weighted composition lets you match multi-field records in one ANN search instead of several rule stages
  • word plus character n-grams preserve robustness to OCR truncation, token reordering, and small textual variation

Compared with TF-IDF:

  • no vocabulary fitting at runtime
  • no corpus dependency
  • same output dimension always
  • easier ANN integration
  • easier deployment
  • simpler persistence

Additional practical advantages:

  • deterministic output for the same input under the same encoder settings
  • can be used online with zero fitting
  • easy to port to Rust later
  • compact fixed-size vectors
  • supports lightweight family-aware encoding
  • supports weighted composition of multiple identifier signals into one query vector

Complex Matching in One Pass

Weighted composition is a practical advantage, not just an API feature.

Example:

  • specific invoice id: x2437-1
  • parent invoice family: x2437

You can encode both and compose them into one retrieval vector:

from idri import IdentifierEncoder, compose_texts, compose_vectors

enc = IdentifierEncoder()

specific = enc.encode("x2437-1", family="invoice")
parent = enc.encode("x2437", family="invoice")

query = compose_vectors([
    (specific, 0.7),
    (parent, 0.3),
])

query_fast = compose_texts(
    enc,
    [("x2437-1", 0.7), ("x2437", 0.3)],
    family="invoice",
)

That lets a downstream ANN index perform one search over a vector that captures both the specific invoice and its family context.

The same pattern works for provider or merchant names. Word features keep strong token identity, while character n-grams add protection against OCR truncation, spacing variation, and partial suffix loss.

Install

uv sync

Quick Start

from idri import IdentifierEncoder, normalize_text

enc = IdentifierEncoder(profile="word_ngram")

vec = enc.encode("ACME Industrial SA de CV")
score = enc.similarity("ACME Industrial", "ACME Industrial SA de CV")
explanation = enc.explain("ACME Industrial", "ACME Industrial SA de CV")

print(normalize_text("  ACME   Industrial SA de CV  "))
print(vec.shape)              # (2048,)
print(type(vec).__name__)     # ndarray
print(score)
print(explanation)

Family-Aware Encoding

from idri import IdentifierEncoder

enc = IdentifierEncoder()
vec = enc.encode("INV-2024-001-03", family="invoice")
score = enc.similarity("INV-2024-001", "INV-2024-001-03", family="invoice")

If family=None, idri uses generic.

Composition Is First-Class

from idri import (
    IdentifierEncoder,
    compose_texts,
    compose_vectors,
    cosine_similarity,
    l2_distance,
)

enc = IdentifierEncoder()

v1 = enc.encode("F0032-3", family="invoice")
v2 = enc.encode("F0032", family="invoice")

composed = compose_vectors([(v1, 0.6), (v2, 0.4)])
composed_from_text = compose_texts(enc, [("F0032-3", 0.6), ("F0032", 0.4)], family="invoice")

composer = enc.start_composer()
composer.add_text("F0032-3", weight=0.6, family="invoice")
composer.add_text("F0032", weight=0.4, family="invoice")
incremental = composer.build()

cos_score = cosine_similarity(composed, incremental)
euclidean_gap = l2_distance(composed, incremental)

compose_texts(...) is the short path for "encode these weighted texts and combine them once". VectorComposer remains the better fit when you need incremental building or explainability.

Retrieval Helpers

from idri import IdentifierEncoder, cosine_similarity_matrix, topk_by_similarity

enc = IdentifierEncoder()
query = enc.encode("invoice 001")
candidates = enc.batch_encode(["invoice 001", "invoice 002", "hammer"])

scores = cosine_similarity_matrix(query, candidates)
indices, top_scores = topk_by_similarity(query, candidates, k=2)

Profiles

  • word: family + words only
  • word_ngram (default): family + words + character n-grams
  • word_ngram_position: same as word_ngram with weak positional decay

API Summary

  • IdentifierEncoder.encode(...) -> numpy.ndarray
  • IdentifierEncoder.batch_encode(...) -> numpy.ndarray
  • IdentifierEncoder.similarity(...) -> float
  • IdentifierEncoder.start_composer(...) -> VectorComposer
  • IdentifierEncoder.explain(...) -> dict
  • compose_texts(...) -> numpy.ndarray
  • compose_vectors(...) -> numpy.ndarray
  • cosine_similarity(...) -> float
  • cosine_similarity_matrix(...) -> numpy.ndarray
  • l2_distance(...) -> float
  • l2_normalize(...) -> numpy.ndarray
  • available_profiles() -> tuple[str, ...]
  • normalize_text(...) -> str
  • topk_by_similarity(...) -> tuple[numpy.ndarray, numpy.ndarray]

Most helpers accept numpy.typing.ArrayLike inputs and return dense numpy.ndarray outputs. encode(...), batch_encode(...), l2_normalize(...), and the similarity-matrix helpers return float32 arrays; compose_vectors(...) and compose_texts(...) also default to float32 but still allow an explicit output dtype.

Exact Cosine vs ANN Search

Exact cosine similarity computed by this library over normalized vectors is the reference behavior.

ANN/vector database search runs in the same vector space but may not return bit-identical rankings. Approximate indexes can reorder near-ties and sometimes change lower-ranked results depending on index settings.

Development

uv sync
uv run pytest
uv run pytest --cov=idri --cov-report=term-missing
uv run python examples/basic_usage.py
uv run python examples/composition_usage.py
uv build

PyPI Release

This repository is configured to publish idri to PyPI through GitHub Actions Trusted Publishing.

Workflow details:

  • repository owner: gocova
  • repository name: idri_py
  • workflow filename: publish.yml
  • GitHub environment: pypi
  • PyPI project name: idri

PyPI setup:

  1. If idri does not exist yet on PyPI, create a pending publisher for project name idri.
  2. If idri already exists on PyPI, add a Trusted Publisher to that project with the workflow details above.
  3. Push a version tag such as v0.1.0 to trigger the release workflow.

The workflow builds the package with uv build, runs the test suite, and publishes with pypa/gh-action-pypi-publish using GitHub OIDC. No long-lived PyPI API token is required.

License

This project is open source under the Mozilla Public License 2.0 (MPL-2.0).
See LICENSE.

Consulting / Commercial Terms

For custom integration, proprietary licensing without attribution, or high-performance optimization, contact the maintainer for consulting.

Contributing

Contributions are welcome, but they are subject to the contributor terms in CLA.md. See CONTRIBUTING.md for workflow and test requirements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idri-2.0.0rc2603130931.dev0.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idri-2.0.0rc2603130931.dev0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file idri-2.0.0rc2603130931.dev0.tar.gz.

File metadata

  • Download URL: idri-2.0.0rc2603130931.dev0.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for idri-2.0.0rc2603130931.dev0.tar.gz
Algorithm Hash digest
SHA256 c503fa3d93ef8f5cce22f67862320f29dc84b4166eb7115347e431a9e6832551
MD5 c7f5bea0c893da72fa1d66754e68954b
BLAKE2b-256 a64ffe07261bc941c5273ec29f1787d56712080fe20849fe284b33884188574b

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-2.0.0rc2603130931.dev0.tar.gz:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file idri-2.0.0rc2603130931.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for idri-2.0.0rc2603130931.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 92f140c8a3f5bb16f58d6d84e46cb3c0354a266bde6849add05933d26c7ba86e
MD5 029fae2548f4cf25344685b53bab28c9
BLAKE2b-256 057955a0bdcd901cdd5695f2274c400a291e2fd5bde33fd283e714b1ab86866a

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-2.0.0rc2603130931.dev0-py3-none-any.whl:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page