Embedding infrastructure — pluggable encoders, ANN indexing, cross-domain similarity
Project description
sandx-embed
Shared embedding and vector similarity infrastructure for the SandX platform.
Part of the SandX Lab computational infrastructure ecosystem.
What It Does
sandx-embed is the shared latent representation layer used by all SandX engines. It provides:
- Pluggable encoders — sentence-transformers models out of the box; register any custom encoder
- High-performance ANN indexing — HNSW (production) and exact search (baseline), with save/load
- Cross-domain similarity — cosine, L2, inner product; normalized and unnormalized vectors
Not a standalone product — consumed by sandx-er, sandx-graph, and sandx-compute as a shared dependency.
Status
v0.1 — Phase 2 active development
| Component | Status |
|---|---|
Encoder — pluggable model registry |
Working |
SentenceTransformerEncoder — SBERT, E5, BGE |
Working |
VectorIndex — HNSW and exact search |
Working |
| Save / load index | Working |
| PyPI package | Planned |
Installation
pip install sandx-embed
Or from source:
git clone https://github.com/sandxlab/sandx-embed
cd sandx-embed
pip install -e ".[dev]"
Quick Start
from sandx_embed import Encoder, VectorIndex
# Encode records into dense vectors
enc = Encoder(model="sentence-bert") # downloads all-MiniLM-L6-v2 on first use
vectors = enc.encode(["John Smith, Boston", "Jon Smyth, Boston", "Alice Brown, NYC"])
# → np.ndarray shape (3, 384), L2-normalized
# Build an ANN index
idx = VectorIndex(method="hnsw", metric="cosine")
idx.build(vectors, ids=["r0", "r1", "r2"])
# Query nearest neighbors
result = idx.query(vectors[0], k=2)
print(result.ids) # ["r0", "r1"]
print(result.distances) # [0.0, 0.12] (cosine distance)
# Persist and reload
idx.save("/tmp/my_index")
idx2 = VectorIndex.load("/tmp/my_index")
Built-in Models
| Name | HuggingFace model | Dim | Notes |
|---|---|---|---|
"sentence-bert" |
all-MiniLM-L6-v2 |
384 | Fast, English, recommended default |
"e5-small" |
intfloat/e5-small-v2 |
384 | Higher quality, English |
"bge-m3" |
BAAI/bge-m3 |
1024 | Multilingual, large |
Custom Encoders
from sandx_embed.encoder import BaseEncoder, Encoder
import numpy as np
class MyEncoder(BaseEncoder):
def encode(self, inputs, *, batch_size=64, normalize=True):
# your model here
return np.random.rand(len(inputs), 128).astype(np.float32)
@property
def dim(self): return 128
Encoder.register("my-model", lambda: MyEncoder())
enc = Encoder("my-model")
Index Methods
| Method | Backend | When to use |
|---|---|---|
"hnsw" |
usearch | N > 10,000; production; fast queries |
"exact" |
numpy | Small datasets; correctness baseline |
Design Principles
- Pluggable — any encoder model or index backend can be registered
- Portable — indexes serialize to disk and reload without rebuilding
- Deterministic — same model version + input → same output
- No vendor lock-in — no hard dependency on any hosted vector service
Related
sandx-er— entity resolution engine (uses sandx-embed for blocking + matching)sandx-graph— graph intelligence over resolved entities- sandx.io — project home
License
Apache 2.0 — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sandx_embed-0.1.0.tar.gz.
File metadata
- Download URL: sandx_embed-0.1.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2462a2d62c5692fa028caf57dbdeb020c850ee3399ac6fd56e1dc24d5dc7ead
|
|
| MD5 |
0ac07db38b0bd22a907f6f7bce803c32
|
|
| BLAKE2b-256 |
20db3f0cd51b3524c9bb964bc04a1d345efd712e178bfbbcf5ea9a93e40a550f
|
File details
Details for the file sandx_embed-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sandx_embed-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81c403a1c2245ce719005ab267717a03464eec4999502b5e83f368bb9fb7fe90
|
|
| MD5 |
9f6b4cad3e406814be47599a102b8441
|
|
| BLAKE2b-256 |
48bab966c1f8dcf714099736fdadce68f6b94b94e4555d5b0cee5eddb1489b16
|