Deterministic approximate vectorization for short identifiers and labels
Project description
idri — deterministic approximate vectorization for short identifiers and labels
idri is a tiny low-level Python vectorization engine for short, noisy identifiers and labels.
It is designed for reproducible approximate matching over short strings, and for composing multiple identifier signals into a single vector before retrieval.
What It Is Good At
- Short business identifiers
- Names and tags
- Prefix/suffix-heavy codes
- OCR-truncated fields
- Partial multiword labels
Examples:
ACME INDUSTRIAL SA DE CVvsACME INDUSTRIALINV-2024-001-03vsINV-2024-001TOPER PULIDORA 4 1/2vsTOPER PULIDORA 4 1 2
What It Is Not For
- Semantic similarity
- Long natural-language embeddings
- Full entity resolution workflows
- Business-rule matching engines
Determinism and Distributed Use
idri is deterministic by construction: the same normalized input, profile, family, dimension, seed, and library behavior should produce the same vector.
That makes it distributed-friendly for systems that need reproducible encoding across workers, services, or edge nodes. This is a reproducibility property, not yet a fully validated distributed-systems guarantee.
Use Cases
idri is strongest when the problem is short-string matching, candidate generation, clustering, or blocking rather than semantic understanding.
- E-commerce and retail catalog deduplication for messy product titles, vendor labels, and SKU-like strings
- Procurement and ERP item crosswalks across supplier catalogs and internal part masters
- Invoice and payment-reference matching for noisy parent/child identifiers
- Marketplace or PIM listing clustering before rules or human review
- Media tag and label search over large collections
- CRM and analytics normalization for campaign names, UTM values, and source labels
- Fraud and payments clustering for noisy merchant descriptors
- IoT and edge-device matching where local deterministic encoding matters more than a cloud model
Why This Exists Instead of Just Fuzzy Matching or TF-IDF
Compared with classic fuzzy matching:
- vectors can be precomputed and indexed once instead of rescoring every candidate pair
- weighted composition lets you match multi-field records in one ANN search instead of several rule stages
- word plus character n-grams preserve robustness to OCR truncation, token reordering, and small textual variation
Compared with TF-IDF:
- no vocabulary fitting at runtime
- no corpus dependency
- same output dimension always
- easier ANN integration
- easier deployment
- simpler persistence
Additional practical advantages:
- deterministic output for the same input under the same encoder settings
- can be used online with zero fitting
- easy to port to Rust later
- compact fixed-size vectors
- supports lightweight family-aware encoding
- supports weighted composition of multiple identifier signals into one query vector
Complex Matching in One Pass
Weighted composition is a practical advantage, not just an API feature.
Example:
- specific invoice id:
x2437-1 - parent invoice family:
x2437
You can encode both and compose them into one retrieval vector:
from idri import IdentifierEncoder, compose_texts, compose_vectors
enc = IdentifierEncoder()
specific = enc.encode("x2437-1", family="invoice")
parent = enc.encode("x2437", family="invoice")
query = compose_vectors([
(specific, 0.7),
(parent, 0.3),
])
query_fast = compose_texts(
enc,
[("x2437-1", 0.7), ("x2437", 0.3)],
family="invoice",
)
That lets a downstream ANN index perform one search over a vector that captures both the specific invoice and its family context.
The same pattern works for provider or merchant names. Word features keep strong token identity, while character n-grams add protection against OCR truncation, spacing variation, and partial suffix loss.
Install
uv sync
Quick Start
from idri import IdentifierEncoder, normalize_text
enc = IdentifierEncoder(profile="word_ngram")
vec = enc.encode("ACME Industrial SA de CV")
score = enc.similarity("ACME Industrial", "ACME Industrial SA de CV")
explanation = enc.explain("ACME Industrial", "ACME Industrial SA de CV")
print(normalize_text(" ACME Industrial SA de CV "))
print(vec.shape) # (2048,)
print(type(vec).__name__) # ndarray
print(score)
print(explanation)
Family-Aware Encoding
from idri import IdentifierEncoder
enc = IdentifierEncoder()
vec = enc.encode("INV-2024-001-03", family="invoice")
score = enc.similarity("INV-2024-001", "INV-2024-001-03", family="invoice")
If family=None, idri uses generic.
Composition Is First-Class
from idri import (
IdentifierEncoder,
compose_texts,
compose_vectors,
cosine_similarity,
l2_distance,
)
enc = IdentifierEncoder()
v1 = enc.encode("F0032-3", family="invoice")
v2 = enc.encode("F0032", family="invoice")
composed = compose_vectors([(v1, 0.6), (v2, 0.4)])
composed_from_text = compose_texts(enc, [("F0032-3", 0.6), ("F0032", 0.4)], family="invoice")
composer = enc.start_composer()
composer.add_text("F0032-3", weight=0.6, family="invoice")
composer.add_text("F0032", weight=0.4, family="invoice")
incremental = composer.build()
cos_score = cosine_similarity(composed, incremental)
euclidean_gap = l2_distance(composed, incremental)
compose_texts(...) is the short path for "encode these weighted texts and combine them once". VectorComposer remains the better fit when you need incremental building or explainability.
Retrieval Helpers
from idri import IdentifierEncoder, cosine_similarity_matrix, topk_by_similarity
enc = IdentifierEncoder()
query = enc.encode("invoice 001")
candidates = enc.batch_encode(["invoice 001", "invoice 002", "hammer"])
scores = cosine_similarity_matrix(query, candidates)
indices, top_scores = topk_by_similarity(query, candidates, k=2)
Profiles
word: family + words onlyword_ngram(default): family + words + character n-gramsword_ngram_position: same asword_ngramwith weak positional decay
API Summary
IdentifierEncoder.encode(...) -> numpy.ndarrayIdentifierEncoder.batch_encode(...) -> numpy.ndarrayIdentifierEncoder.similarity(...) -> floatIdentifierEncoder.start_composer(...) -> VectorComposerIdentifierEncoder.explain(...) -> dictcompose_texts(...) -> numpy.ndarraycompose_vectors(...) -> numpy.ndarraycosine_similarity(...) -> floatcosine_similarity_matrix(...) -> numpy.ndarrayl2_distance(...) -> floatl2_normalize(...) -> numpy.ndarrayavailable_profiles() -> tuple[str, ...]normalize_text(...) -> strtopk_by_similarity(...) -> tuple[numpy.ndarray, numpy.ndarray]
Most helpers accept numpy.typing.ArrayLike inputs and return dense numpy.ndarray outputs. encode(...), batch_encode(...), l2_normalize(...), and the similarity-matrix helpers return float32 arrays; compose_vectors(...) and compose_texts(...) also default to float32 but still allow an explicit output dtype.
Exact Cosine vs ANN Search
Exact cosine similarity computed by this library over normalized vectors is the reference behavior.
ANN/vector database search runs in the same vector space but may not return bit-identical rankings. Approximate indexes can reorder near-ties and sometimes change lower-ranked results depending on index settings.
Development
uv sync
uv run pytest
uv run pytest --cov=idri --cov-report=term-missing
uv run python examples/basic_usage.py
uv run python examples/composition_usage.py
uv build
PyPI Release
This repository is configured to publish idri to PyPI through GitHub Actions Trusted Publishing.
Workflow details:
- repository owner:
gocova - repository name:
idri_py - workflow filename:
publish.yml - GitHub environment:
pypi - PyPI project name:
idri
PyPI setup:
- If
idridoes not exist yet on PyPI, create a pending publisher for project nameidri. - If
idrialready exists on PyPI, add a Trusted Publisher to that project with the workflow details above. - Push a version tag such as
v0.1.0to trigger the release workflow.
The workflow builds the package with uv build, runs the test suite, and publishes with pypa/gh-action-pypi-publish using GitHub OIDC. No long-lived PyPI API token is required.
License
This project is open source under the Mozilla Public License 2.0 (MPL-2.0).
See LICENSE.
Consulting / Commercial Terms
For custom integration, proprietary licensing without attribution, or high-performance optimization, contact the maintainer for consulting.
Contributing
Contributions are welcome, but they are subject to the contributor terms in CLA.md. See CONTRIBUTING.md for workflow and test requirements.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file idri-2.0.0rc2603130931.dev0.tar.gz.
File metadata
- Download URL: idri-2.0.0rc2603130931.dev0.tar.gz
- Upload date:
- Size: 63.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c503fa3d93ef8f5cce22f67862320f29dc84b4166eb7115347e431a9e6832551
|
|
| MD5 |
c7f5bea0c893da72fa1d66754e68954b
|
|
| BLAKE2b-256 |
a64ffe07261bc941c5273ec29f1787d56712080fe20849fe284b33884188574b
|
Provenance
The following attestation bundles were made for idri-2.0.0rc2603130931.dev0.tar.gz:
Publisher:
publish.yml on gocova/idri_py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
idri-2.0.0rc2603130931.dev0.tar.gz -
Subject digest:
c503fa3d93ef8f5cce22f67862320f29dc84b4166eb7115347e431a9e6832551 - Sigstore transparency entry: 1096872233
- Sigstore integration time:
-
Permalink:
gocova/idri_py@88d07e632bbcadf481616252fc9cfbf961000b5d -
Branch / Tag:
refs/tags/v2.0.0rc2603130930 - Owner: https://github.com/gocova
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88d07e632bbcadf481616252fc9cfbf961000b5d -
Trigger Event:
push
-
Statement type:
File details
Details for the file idri-2.0.0rc2603130931.dev0-py3-none-any.whl.
File metadata
- Download URL: idri-2.0.0rc2603130931.dev0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92f140c8a3f5bb16f58d6d84e46cb3c0354a266bde6849add05933d26c7ba86e
|
|
| MD5 |
029fae2548f4cf25344685b53bab28c9
|
|
| BLAKE2b-256 |
057955a0bdcd901cdd5695f2274c400a291e2fd5bde33fd283e714b1ab86866a
|
Provenance
The following attestation bundles were made for idri-2.0.0rc2603130931.dev0-py3-none-any.whl:
Publisher:
publish.yml on gocova/idri_py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
idri-2.0.0rc2603130931.dev0-py3-none-any.whl -
Subject digest:
92f140c8a3f5bb16f58d6d84e46cb3c0354a266bde6849add05933d26c7ba86e - Sigstore transparency entry: 1096872235
- Sigstore integration time:
-
Permalink:
gocova/idri_py@88d07e632bbcadf481616252fc9cfbf961000b5d -
Branch / Tag:
refs/tags/v2.0.0rc2603130930 - Owner: https://github.com/gocova
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88d07e632bbcadf481616252fc9cfbf961000b5d -
Trigger Event:
push
-
Statement type: