Skip to main content

Embedding infrastructure — pluggable encoders, ANN indexing, cross-domain similarity

Project description

sandx-embed

Shared embedding and vector similarity infrastructure for the SandX platform.

CI Python 3.10+ License: Apache 2.0

Part of the SandX Lab computational infrastructure ecosystem.


What It Does

sandx-embed is the shared latent representation layer used by all SandX engines. It provides:

  • Pluggable encoders — sentence-transformers models out of the box; register any custom encoder
  • High-performance ANN indexing — HNSW (production) and exact search (baseline), with save/load
  • Cross-domain similarity — cosine, L2, inner product; normalized and unnormalized vectors

Not a standalone product — consumed by sandx-er, sandx-graph, and sandx-compute as a shared dependency.

Status

v0.1 — Phase 2 active development

Component Status
Encoder — pluggable model registry Working
SentenceTransformerEncoder — SBERT, E5, BGE Working
VectorIndex — HNSW and exact search Working
Save / load index Working
PyPI package Planned

Installation

pip install sandx-embed

Or from source:

git clone https://github.com/sandxlab/sandx-embed
cd sandx-embed
pip install -e ".[dev]"

Quick Start

from sandx_embed import Encoder, VectorIndex

# Encode records into dense vectors
enc = Encoder(model="sentence-bert")   # downloads all-MiniLM-L6-v2 on first use
vectors = enc.encode(["John Smith, Boston", "Jon Smyth, Boston", "Alice Brown, NYC"])
# → np.ndarray shape (3, 384), L2-normalized

# Build an ANN index
idx = VectorIndex(method="hnsw", metric="cosine")
idx.build(vectors, ids=["r0", "r1", "r2"])

# Query nearest neighbors
result = idx.query(vectors[0], k=2)
print(result.ids)        # ["r0", "r1"]
print(result.distances)  # [0.0, 0.12]  (cosine distance)

# Persist and reload
idx.save("/tmp/my_index")
idx2 = VectorIndex.load("/tmp/my_index")

Built-in Models

Name HuggingFace model Dim Notes
"sentence-bert" all-MiniLM-L6-v2 384 Fast, English, recommended default
"e5-small" intfloat/e5-small-v2 384 Higher quality, English
"bge-m3" BAAI/bge-m3 1024 Multilingual, large

Custom Encoders

from sandx_embed.encoder import BaseEncoder, Encoder
import numpy as np

class MyEncoder(BaseEncoder):
    def encode(self, inputs, *, batch_size=64, normalize=True):
        # your model here
        return np.random.rand(len(inputs), 128).astype(np.float32)
    @property
    def dim(self): return 128

Encoder.register("my-model", lambda: MyEncoder())
enc = Encoder("my-model")

Index Methods

Method Backend When to use
"hnsw" usearch N > 10,000; production; fast queries
"exact" numpy Small datasets; correctness baseline

Design Principles

  • Pluggable — any encoder model or index backend can be registered
  • Portable — indexes serialize to disk and reload without rebuilding
  • Deterministic — same model version + input → same output
  • No vendor lock-in — no hard dependency on any hosted vector service

Related

  • sandx-er — entity resolution engine (uses sandx-embed for blocking + matching)
  • sandx-graph — graph intelligence over resolved entities
  • sandx.io — project home

License

Apache 2.0 — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandx_embed-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sandx_embed-0.1.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file sandx_embed-0.1.0.tar.gz.

File metadata

  • Download URL: sandx_embed-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sandx_embed-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b2462a2d62c5692fa028caf57dbdeb020c850ee3399ac6fd56e1dc24d5dc7ead
MD5 0ac07db38b0bd22a907f6f7bce803c32
BLAKE2b-256 20db3f0cd51b3524c9bb964bc04a1d345efd712e178bfbbcf5ea9a93e40a550f

See more details on using hashes here.

File details

Details for the file sandx_embed-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sandx_embed-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sandx_embed-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81c403a1c2245ce719005ab267717a03464eec4999502b5e83f368bb9fb7fe90
MD5 9f6b4cad3e406814be47599a102b8441
BLAKE2b-256 48bab966c1f8dcf714099736fdadce68f6b94b94e4555d5b0cee5eddb1489b16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page