Skip to main content

Embedding Locality IDentifier - encode embeddings into sortable string IDs for vector search, plus fast string similarity algorithms

Project description

ELID - Embedding Locality IDentifier

CI License

ELID enables vector search without a vector store by encoding high-dimensional embeddings into sortable string IDs that preserve locality. Similar vectors produce similar IDs, allowing you to use standard database indexes for similarity search.

ELID also includes a complete suite of fast string similarity algorithms.

Features

Embedding Encoding (Vector Search Without Vector Stores)

Convert embeddings from any ML model into compact, sortable identifiers:

Profile Output Best For
Mini128 26-char base32hex Fast similarity via Hamming distance
Morton10x10 20-char base32hex Database range queries (Z-order)
Hilbert10x10 20-char base32hex Maximum locality preservation

Key benefits:

  • Similar vectors produce similar IDs (locality preservation)
  • IDs are lexicographically sortable for database indexing
  • No vector store required - use any database with string indexes
  • Deterministic: same embedding always produces the same ID

String Similarity Algorithms

Algorithm Type Best For
Levenshtein Edit distance General-purpose comparison, spell checking
Normalized Levenshtein Similarity (0-1) When you need a percentage match
Jaro Similarity (0-1) Short strings
Jaro-Winkler Similarity (0-1) Names and record linkage
Hamming Distance Fixed-length strings, DNA, error codes
OSA Edit distance Typo detection (counts transpositions)
SimHash LSH fingerprint Database-queryable similarity, near-duplicate detection
Best Match Composite (0-1) When unsure which algorithm fits

Installation

Rust

# String similarity only (zero dependencies)
[dependencies]
elid = "0.1"

# Embedding encoding
[dependencies]
elid = { version = "0.1", features = ["embeddings"] }

# Both features
[dependencies]
elid = { version = "0.1", features = ["strings", "embeddings"] }

Python

pip install elid

JavaScript (WASM)

npm install elid-wasm

C/C++

Build with cargo build --release --features ffi to get libelid.so and elid.h.

Quick Start

Embedding Encoding (Rust)

use elid::embeddings::{encode, Profile, Elid};

// Get an embedding from your ML model (e.g., OpenAI, Cohere, sentence-transformers)
let embedding: Vec<f32> = model.embed("Hello, world!")?;

// Encode to a sortable ELID
let profile = Profile::default(); // Mini128
let elid: Elid = encode(&embedding, &profile)?;

println!("ELID: {}", elid); // e.g., "01a3f5g7h9jklmnopqrstuv"

// Similar texts produce similar ELIDs
let elid2 = encode(&model.embed("Hello, universe!")?, &profile)?;

// Compare similarity via Hamming distance
use elid::embeddings::hamming_distance;
let distance = hamming_distance(&elid, &elid2)?; // Lower = more similar

Encoding Profiles

use elid::embeddings::Profile;

// Mini128: 128-bit SimHash (default)
// Best for: Fast similarity search via Hamming distance
let mini = Profile::Mini128 {
    seed: 0x454c4944_53494d48, // Deterministic seed
};

// Morton10x10: Z-order curve encoding
// Best for: Database range queries
let morton = Profile::Morton10x10 {
    dims: 10,
    bits_per_dim: 10,
    transform_id: None,
};

// Hilbert10x10: Hilbert curve encoding
// Best for: Maximum locality preservation
let hilbert = Profile::Hilbert10x10 {
    dims: 10,
    bits_per_dim: 10,
    transform_id: None,
};

String Similarity (Rust)

use elid::*;

// Edit distance
let distance = levenshtein("kitten", "sitting"); // 3

// Normalized similarity (0.0 to 1.0)
let similarity = normalized_levenshtein("hello", "hallo"); // 0.8

// Name matching
let similarity = jaro_winkler("Martha", "Marhta"); // 0.961

// SimHash for database queries
let hash = simhash("iPhone 14");
let sim = simhash_similarity("iPhone 14", "iPhone 15"); // ~0.92

// Find best match in a list
let candidates = vec!["apple", "application", "apply"];
let (idx, score) = find_best_match("app", &candidates);

Python

import elid

# String similarity
elid.levenshtein("kitten", "sitting")  # 3
elid.jaro_winkler("martha", "marhta")  # 0.961
elid.simhash_similarity("iPhone 14", "iPhone 15")  # 0.922

# Embedding encoding (with embeddings feature)
embedding = model.embed("Hello, world!")
elid_str = elid.encode_embedding(embedding)

JavaScript

import init, { levenshtein, jaroWinkler, simhashSimilarity } from 'elid';

await init();
levenshtein("kitten", "sitting");  // 3
jaroWinkler("martha", "marhta");   // 0.961
simhashSimilarity("iPhone 14", "iPhone 15");  // 0.922

Configuration

Use SimilarityOpts for case-insensitive or whitespace-trimmed comparisons:

use elid::{levenshtein_with_opts, SimilarityOpts};

let opts = SimilarityOpts {
    case_sensitive: false,
    trim_whitespace: true,
    ..Default::default()
};
let distance = levenshtein_with_opts("  HELLO  ", "hello", &opts); // 0

Feature Flags

Feature Description Dependencies
strings String similarity algorithms (default) None
embeddings Embedding encoding (default) rand, blake3, etc.
models Base ONNX model support tract-onnx
models-text Text embedding (Model2Vec, 256-dim) models
models-image Image embedding (MobileNetV3, 1024-dim) models, image
wasm WebAssembly bindings (includes embeddings) wasm-bindgen, js-sys, getrandom
python Python bindings via PyO3 (includes embeddings) pyo3, numpy, rayon
ffi C FFI bindings None (enables unsafe)

Performance

  • Zero external dependencies for string-only use
  • O(min(m,n)) space-optimized Levenshtein
  • 1.4M+ string comparisons per second (Python benchmarks)
  • ~96KB WASM binary (strings only)
  • Embedding encoding: <1ms per vector

Built-in Embedding Models

ELID includes optional ONNX models for generating embeddings directly, without external API calls. Models are bundled via separate packages:

Package Model Dimensions Size
elid-text Model2Vec potion-base-8M 256 ~8MB
elid-image MobileNetV3-Small 1024 ~5MB

Text embeddings:

use elid::models::embed_text;

let embedding = embed_text("Hello, world!")?;
assert_eq!(embedding.len(), 256);

Image embeddings:

use elid::models::embed_image;

let bytes = std::fs::read("photo.jpg")?;
let embedding = embed_image(&bytes)?;
assert_eq!(embedding.len(), 1024);

LSH Bands for Database Querying

Convert embeddings to LSH bands for efficient database similarity search:

import { embeddingToBands } from 'elid';

// Split embedding into 4 bands (32 bits each)
const bands = embeddingToBands(embedding, 4);

// Store bands in database columns
// Query with OR across bands for approximate nearest neighbors:
// SELECT * FROM embeddings WHERE band0 = ? OR band1 = ? OR band2 = ? OR band3 = ?
use elid::embeddings::embedding_to_bands;

let bands = embedding_to_bands(&embedding, 4, 0x454c4944_53494d48);
// bands: Vec<String> with 4 base32hex-encoded band strings

Use Cases

Vector Search Without Vector Stores

Store ELIDs directly in PostgreSQL, SQLite, or any database:

-- Create index on ELID column
CREATE INDEX idx_documents_elid ON documents(elid);

-- Find similar documents using string prefix matching
SELECT * FROM documents
WHERE elid LIKE 'abc%'  -- Prefix match for locality
ORDER BY elid;

Deduplication

Use SimHash to find near-duplicate content:

let hash1 = simhash("The quick brown fox");
let hash2 = simhash("The quick brown dog");
let similarity = simhash_similarity_from_hashes(hash1, hash2);
if similarity > 0.9 {
    println!("Likely duplicates!");
}

Fuzzy Search

Find matches with typo tolerance:

let candidates = vec!["apple", "application", "apply", "banana"];
let matches = find_matches_above_threshold("aple", &candidates, 0.7);
// Returns: [("apple", 0.8), ...]

Building

git clone https://github.com/ZachHandley/ELID.git
cd ELID

cargo build --release
cargo test
cargo bench
cargo run --example basic_usage

License

Dual-licensed under MIT or Apache-2.0 at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

elid-0.4.24-cp313-cp313-manylinux_2_34_x86_64.whl (417.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file elid-0.4.24-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for elid-0.4.24-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ea52f76ccc08895fffe3b26c27b243682c4830dd80c23200dafbeba4cec0fa03
MD5 062b95d6990da12319027448688780b7
BLAKE2b-256 42424c69ad3bae1989bb7e53fdd8c76a8f9f4bfe0b64f9106e933009fc21ad9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page