Embedding Locality IDentifier - encode embeddings into sortable string IDs for vector search, plus fast string similarity algorithms
Project description
ELID - Embedding Locality IDentifier
ELID enables vector search without a vector store by encoding high-dimensional embeddings into sortable string IDs that preserve locality. Similar vectors produce similar IDs, allowing you to use standard database indexes for similarity search.
ELID also includes a complete suite of fast string similarity algorithms.
Features
Embedding Encoding (Vector Search Without Vector Stores)
Convert embeddings from any ML model into compact, sortable identifiers:
| Profile | Output | Best For |
|---|---|---|
| Mini128 | 26-char base32hex | Fast similarity via Hamming distance |
| Morton10x10 | 20-char base32hex | Database range queries (Z-order) |
| Hilbert10x10 | 20-char base32hex | Maximum locality preservation |
Key benefits:
- Similar vectors produce similar IDs (locality preservation)
- IDs are lexicographically sortable for database indexing
- No vector store required - use any database with string indexes
- Deterministic: same embedding always produces the same ID
String Similarity Algorithms
| Algorithm | Type | Best For |
|---|---|---|
| Levenshtein | Edit distance | General-purpose comparison, spell checking |
| Normalized Levenshtein | Similarity (0-1) | When you need a percentage match |
| Jaro | Similarity (0-1) | Short strings |
| Jaro-Winkler | Similarity (0-1) | Names and record linkage |
| Hamming | Distance | Fixed-length strings, DNA, error codes |
| OSA | Edit distance | Typo detection (counts transpositions) |
| SimHash | LSH fingerprint | Database-queryable similarity, near-duplicate detection |
| Best Match | Composite (0-1) | When unsure which algorithm fits |
Installation
Rust
# String similarity only (zero dependencies)
[dependencies]
elid = "0.1"
# Embedding encoding
[dependencies]
elid = { version = "0.1", features = ["embeddings"] }
# Both features
[dependencies]
elid = { version = "0.1", features = ["strings", "embeddings"] }
Python
pip install elid
JavaScript (WASM)
npm install elid-wasm
C/C++
Build with cargo build --release --features ffi to get libelid.so and elid.h.
Quick Start
Embedding Encoding (Rust)
use elid::embeddings::{encode, Profile, Elid};
// Get an embedding from your ML model (e.g., OpenAI, Cohere, sentence-transformers)
let embedding: Vec<f32> = model.embed("Hello, world!")?;
// Encode to a sortable ELID
let profile = Profile::default(); // Mini128
let elid: Elid = encode(&embedding, &profile)?;
println!("ELID: {}", elid); // e.g., "01a3f5g7h9jklmnopqrstuv"
// Similar texts produce similar ELIDs
let elid2 = encode(&model.embed("Hello, universe!")?, &profile)?;
// Compare similarity via Hamming distance
use elid::embeddings::hamming_distance;
let distance = hamming_distance(&elid, &elid2)?; // Lower = more similar
Encoding Profiles
use elid::embeddings::Profile;
// Mini128: 128-bit SimHash (default)
// Best for: Fast similarity search via Hamming distance
let mini = Profile::Mini128 {
seed: 0x454c4944_53494d48, // Deterministic seed
};
// Morton10x10: Z-order curve encoding
// Best for: Database range queries
let morton = Profile::Morton10x10 {
dims: 10,
bits_per_dim: 10,
transform_id: None,
};
// Hilbert10x10: Hilbert curve encoding
// Best for: Maximum locality preservation
let hilbert = Profile::Hilbert10x10 {
dims: 10,
bits_per_dim: 10,
transform_id: None,
};
String Similarity (Rust)
use elid::*;
// Edit distance
let distance = levenshtein("kitten", "sitting"); // 3
// Normalized similarity (0.0 to 1.0)
let similarity = normalized_levenshtein("hello", "hallo"); // 0.8
// Name matching
let similarity = jaro_winkler("Martha", "Marhta"); // 0.961
// SimHash for database queries
let hash = simhash("iPhone 14");
let sim = simhash_similarity("iPhone 14", "iPhone 15"); // ~0.92
// Find best match in a list
let candidates = vec!["apple", "application", "apply"];
let (idx, score) = find_best_match("app", &candidates);
Python
import elid
# String similarity
elid.levenshtein("kitten", "sitting") # 3
elid.jaro_winkler("martha", "marhta") # 0.961
elid.simhash_similarity("iPhone 14", "iPhone 15") # 0.922
# Embedding encoding (with embeddings feature)
embedding = model.embed("Hello, world!")
elid_str = elid.encode_embedding(embedding)
JavaScript
import init, { levenshtein, jaroWinkler, simhashSimilarity } from 'elid';
await init();
levenshtein("kitten", "sitting"); // 3
jaroWinkler("martha", "marhta"); // 0.961
simhashSimilarity("iPhone 14", "iPhone 15"); // 0.922
Configuration
Use SimilarityOpts for case-insensitive or whitespace-trimmed comparisons:
use elid::{levenshtein_with_opts, SimilarityOpts};
let opts = SimilarityOpts {
case_sensitive: false,
trim_whitespace: true,
..Default::default()
};
let distance = levenshtein_with_opts(" HELLO ", "hello", &opts); // 0
Feature Flags
| Feature | Description | Dependencies |
|---|---|---|
strings |
String similarity algorithms (default) | None |
embeddings |
Embedding encoding (default) | rand, blake3, etc. |
wasm |
WebAssembly bindings (includes embeddings) | wasm-bindgen, js-sys, getrandom |
python |
Python bindings via PyO3 (includes embeddings) | pyo3, numpy, rayon |
ffi |
C FFI bindings | None (enables unsafe) |
Performance
- Zero external dependencies for string-only use
- O(min(m,n)) space-optimized Levenshtein
- 1.4M+ string comparisons per second (Python benchmarks)
- ~96KB WASM binary (strings only)
- Embedding encoding: <1ms per vector
Use Cases
Vector Search Without Vector Stores
Store ELIDs directly in PostgreSQL, SQLite, or any database:
-- Create index on ELID column
CREATE INDEX idx_documents_elid ON documents(elid);
-- Find similar documents using string prefix matching
SELECT * FROM documents
WHERE elid LIKE 'abc%' -- Prefix match for locality
ORDER BY elid;
Deduplication
Use SimHash to find near-duplicate content:
let hash1 = simhash("The quick brown fox");
let hash2 = simhash("The quick brown dog");
let similarity = simhash_similarity_from_hashes(hash1, hash2);
if similarity > 0.9 {
println!("Likely duplicates!");
}
Fuzzy Search
Find matches with typo tolerance:
let candidates = vec!["apple", "application", "apply", "banana"];
let matches = find_matches_above_threshold("aple", &candidates, 0.7);
// Returns: [("apple", 0.8), ...]
Building
git clone https://github.com/ZachHandley/ELID.git
cd ELID
cargo build --release
cargo test
cargo bench
cargo run --example basic_usage
License
Dual-licensed under MIT or Apache-2.0 at your option.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file elid-0.3.0-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: elid-0.3.0-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 414.0 kB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e893aa7a23ef2e782ffb175c8f07094173632c5b68796a03809e2d0d6ce5294
|
|
| MD5 |
c37a85cff0edf82c518ae967816c6322
|
|
| BLAKE2b-256 |
ac07439b6ed78aa4951b9c5efff5fed44c6ce2a3bd1b5cd7aa25765f8f466aaf
|