Deterministic approximate vectorization for short identifiers and labels

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gocova

These details have not been verified by PyPI

Project links

Funding

Project description

idri — deterministic approximate vectorization for short identifiers and labels

idri is a tiny low-level Python vectorization engine for short, noisy identifiers and labels.

It is designed for reproducible approximate matching over short strings, and for composing multiple identifier signals into a single vector before retrieval.

What It Is Good At

Short business identifiers
Names and tags
Prefix/suffix-heavy codes
OCR-truncated fields
Partial multiword labels

Examples:

ACME INDUSTRIAL SA DE CV vs ACME INDUSTRIAL
INV-2024-001-03 vs INV-2024-001
TOPER PULIDORA 4 1/2 vs TOPER PULIDORA 4 1 2

What It Is Not For

Semantic similarity
Long natural-language embeddings
Full entity resolution workflows
Business-rule matching engines

Determinism and Distributed Use

idri is deterministic by construction: the same normalized input, profile, family, dimension, seed, and library behavior should produce the same vector.

That makes it distributed-friendly for systems that need reproducible encoding across workers, services, or edge nodes. This is a reproducibility property, not yet a fully validated distributed-systems guarantee.

Use Cases

idri is strongest when the problem is short-string matching, candidate generation, clustering, or blocking rather than semantic understanding.

E-commerce and retail catalog deduplication for messy product titles, vendor labels, and SKU-like strings
Procurement and ERP item crosswalks across supplier catalogs and internal part masters
Invoice and payment-reference matching for noisy parent/child identifiers
Marketplace or PIM listing clustering before rules or human review
Media tag and label search over large collections
CRM and analytics normalization for campaign names, UTM values, and source labels
Fraud and payments clustering for noisy merchant descriptors
IoT and edge-device matching where local deterministic encoding matters more than a cloud model

Why This Exists Instead of Just Fuzzy Matching or TF-IDF

Compared with classic fuzzy matching:

vectors can be precomputed and indexed once instead of rescoring every candidate pair
weighted composition lets you match multi-field records in one ANN search instead of several rule stages
word plus character n-grams preserve robustness to OCR truncation, token reordering, and small textual variation

Compared with TF-IDF:

no vocabulary fitting at runtime
no corpus dependency
same output dimension always
easier ANN integration
easier deployment
simpler persistence

Additional practical advantages:

deterministic output for the same input under the same encoder settings
can be used online with zero fitting
easy to port to Rust later
compact fixed-size vectors
supports lightweight family-aware encoding
supports weighted composition of multiple identifier signals into one query vector

Complex Matching in One Pass

Weighted composition is a practical advantage, not just an API feature.

Example:

specific invoice id: x2437-1
parent invoice family: x2437

You can encode both and compose them into one retrieval vector:

from idri import IdentifierEncoder, compose_texts, compose_vectors

enc = IdentifierEncoder()

specific = enc.encode("x2437-1", family="invoice")
parent = enc.encode("x2437", family="invoice")

query = compose_vectors([
    (specific, 0.7),
    (parent, 0.3),
])

query_fast = compose_texts(
    enc,
    [("x2437-1", 0.7), ("x2437", 0.3)],
    family="invoice",
)

That lets a downstream ANN index perform one search over a vector that captures both the specific invoice and its family context.

The same pattern works for provider or merchant names. Word features keep strong token identity, while character n-grams add protection against OCR truncation, spacing variation, and partial suffix loss.

Install

uv sync

Quick Start

from idri import IdentifierEncoder, normalize_text

enc = IdentifierEncoder(profile="word_ngram")

vec = enc.encode("ACME Industrial SA de CV")
score = enc.similarity("ACME Industrial", "ACME Industrial SA de CV")
explanation = enc.explain("ACME Industrial", "ACME Industrial SA de CV")

print(normalize_text("  ACME   Industrial SA de CV  "))
print(vec.shape)              # (2048,)
print(type(vec).__name__)     # ndarray
print(score)
print(explanation)

Family-Aware Encoding

from idri import IdentifierEncoder

enc = IdentifierEncoder()
vec = enc.encode("INV-2024-001-03", family="invoice")
score = enc.similarity("INV-2024-001", "INV-2024-001-03", family="invoice")

If family=None, idri uses generic.

Composition Is First-Class

from idri import (
    IdentifierEncoder,
    compose_texts,
    compose_vectors,
    cosine_similarity,
    l2_distance,
)

enc = IdentifierEncoder()

v1 = enc.encode("F0032-3", family="invoice")
v2 = enc.encode("F0032", family="invoice")

composed = compose_vectors([(v1, 0.6), (v2, 0.4)])
composed_from_text = compose_texts(enc, [("F0032-3", 0.6), ("F0032", 0.4)], family="invoice")

composer = enc.start_composer()
composer.add_text("F0032-3", weight=0.6, family="invoice")
composer.add_text("F0032", weight=0.4, family="invoice")
incremental = composer.build()

cos_score = cosine_similarity(composed, incremental)
euclidean_gap = l2_distance(composed, incremental)

compose_texts(...) is the short path for "encode these weighted texts and combine them once". VectorComposer remains the better fit when you need incremental building or explainability.

Retrieval Helpers

from idri import IdentifierEncoder, cosine_similarity_matrix, topk_by_similarity

enc = IdentifierEncoder()
query = enc.encode("invoice 001")
candidates = enc.batch_encode(["invoice 001", "invoice 002", "hammer"])

scores = cosine_similarity_matrix(query, candidates)
indices, top_scores = topk_by_similarity(query, candidates, k=2)

Profiles

word: family + words only
word_ngram (default): family + words + character n-grams
word_ngram_position: same as word_ngram with weak positional decay

API Summary

IdentifierEncoder.encode(...) -> numpy.ndarray
IdentifierEncoder.batch_encode(...) -> numpy.ndarray
IdentifierEncoder.similarity(...) -> float
IdentifierEncoder.start_composer(...) -> VectorComposer
IdentifierEncoder.explain(...) -> dict
compose_texts(...) -> numpy.ndarray
compose_vectors(...) -> numpy.ndarray
cosine_similarity(...) -> float
cosine_similarity_matrix(...) -> numpy.ndarray
l2_distance(...) -> float
l2_normalize(...) -> numpy.ndarray
available_profiles() -> tuple[str, ...]
normalize_text(...) -> str
topk_by_similarity(...) -> tuple[numpy.ndarray, numpy.ndarray]

Most helpers accept numpy.typing.ArrayLike inputs and return dense numpy.ndarray outputs. encode(...), batch_encode(...), l2_normalize(...), and the similarity-matrix helpers return float32 arrays; compose_vectors(...) and compose_texts(...) also default to float32 but still allow an explicit output dtype.

Exact Cosine vs ANN Search

Exact cosine similarity computed by this library over normalized vectors is the reference behavior.

ANN/vector database search runs in the same vector space but may not return bit-identical rankings. Approximate indexes can reorder near-ties and sometimes change lower-ranked results depending on index settings.

Development

uv sync
uv run pytest
uv run pytest --cov=idri --cov-report=term-missing
uv run python examples/basic_usage.py
uv run python examples/composition_usage.py
uv build

PyPI Release

This repository is configured to publish idri to PyPI through GitHub Actions Trusted Publishing.

Workflow details:

repository owner: gocova
repository name: idri_py
workflow filename: publish.yml
GitHub environment: pypi
PyPI project name: idri

PyPI setup:

If idri does not exist yet on PyPI, create a pending publisher for project name idri.
If idri already exists on PyPI, add a Trusted Publisher to that project with the workflow details above.
Push a version tag such as v0.1.0 to trigger the release workflow.

The workflow builds the package with uv build, runs the test suite, and publishes with pypa/gh-action-pypi-publish using GitHub OIDC. No long-lived PyPI API token is required.

License

This project is open source under the Mozilla Public License 2.0 (MPL-2.0).
See LICENSE.

Consulting / Commercial Terms

For custom integration, proprietary licensing without attribution, or high-performance optimization, contact the maintainer for consulting.

Contributing

Contributions are welcome, but they are subject to the contributor terms in CLA.md. See CONTRIBUTING.md for workflow and test requirements.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gocova

These details have not been verified by PyPI

Project links

Funding

Release history Release notifications | RSS feed

This version

2.0.0rc2603130931.dev0 pre-release

Mar 13, 2026

1.1.0rc2603092137.dev0 pre-release

Mar 10, 2026

1.0.0rc2603092116.dev0 pre-release

Mar 10, 2026

0.1.0a603081758.dev0 pre-release

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idri-2.0.0rc2603130931.dev0.tar.gz (63.5 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

idri-2.0.0rc2603130931.dev0-py3-none-any.whl (16.6 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file idri-2.0.0rc2603130931.dev0.tar.gz.

File metadata

Download URL: idri-2.0.0rc2603130931.dev0.tar.gz
Upload date: Mar 13, 2026
Size: 63.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for idri-2.0.0rc2603130931.dev0.tar.gz
Algorithm	Hash digest
SHA256	`c503fa3d93ef8f5cce22f67862320f29dc84b4166eb7115347e431a9e6832551`
MD5	`c7f5bea0c893da72fa1d66754e68954b`
BLAKE2b-256	`a64ffe07261bc941c5273ec29f1787d56712080fe20849fe284b33884188574b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-2.0.0rc2603130931.dev0.tar.gz:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: idri-2.0.0rc2603130931.dev0.tar.gz
- Subject digest: c503fa3d93ef8f5cce22f67862320f29dc84b4166eb7115347e431a9e6832551
- Sigstore transparency entry: 1096872233
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: gocova/idri_py@88d07e632bbcadf481616252fc9cfbf961000b5d
- Branch / Tag: refs/tags/v2.0.0rc2603130930
- Owner: https://github.com/gocova
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88d07e632bbcadf481616252fc9cfbf961000b5d
- Trigger Event: push

File details

Details for the file idri-2.0.0rc2603130931.dev0-py3-none-any.whl.

File metadata

Download URL: idri-2.0.0rc2603130931.dev0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for idri-2.0.0rc2603130931.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92f140c8a3f5bb16f58d6d84e46cb3c0354a266bde6849add05933d26c7ba86e`
MD5	`029fae2548f4cf25344685b53bab28c9`
BLAKE2b-256	`057955a0bdcd901cdd5695f2274c400a291e2fd5bde33fd283e714b1ab86866a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for idri-2.0.0rc2603130931.dev0-py3-none-any.whl:

Publisher: publish.yml on gocova/idri_py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: idri-2.0.0rc2603130931.dev0-py3-none-any.whl
- Subject digest: 92f140c8a3f5bb16f58d6d84e46cb3c0354a266bde6849add05933d26c7ba86e
- Sigstore transparency entry: 1096872235
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: gocova/idri_py@88d07e632bbcadf481616252fc9cfbf961000b5d
- Branch / Tag: refs/tags/v2.0.0rc2603130930
- Owner: https://github.com/gocova
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88d07e632bbcadf481616252fc9cfbf961000b5d
- Trigger Event: push

idri 2.0.0rc2603130931.dev0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

idri — deterministic approximate vectorization for short identifiers and labels

What It Is Good At

What It Is Not For

Determinism and Distributed Use

Use Cases

Why This Exists Instead of Just Fuzzy Matching or TF-IDF

Complex Matching in One Pass

Install

Quick Start

Family-Aware Encoding

Composition Is First-Class

Retrieval Helpers

Profiles

API Summary

Exact Cosine vs ANN Search

Development

PyPI Release

License

Consulting / Commercial Terms

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance