Main project hub for the uubed high-performance embedding encoding library
Project description
uubed - High-Performance Position-Safe Embeddings
uubed (pronounced "you-you-bed") is a high-performance library for encoding embedding vectors into position-safe strings that solve the "substring pollution" problem in search systems.
🏗️ Project Structure
This is the main repository for the uubed project. The implementation is split across multiple repositories:
- uubed (this repo) - Project coordination and documentation
- uubed-rs - High-performance Rust implementation
- uubed-py - Python bindings and API
- uubed-docs - Comprehensive documentation and book
🚀 Key Features
- Position-Safe Encoding: QuadB64 family prevents false substring matches
- Blazing Fast: 40-105x faster than pure Python with Rust acceleration
- Multiple Encoding Methods: Full precision, SimHash, Top-k, Z-order
- Search Engine Friendly: No more substring pollution in Elasticsearch/Solr
- Easy Integration: Simple API, works with any vector database
📊 Performance
With native Rust acceleration:
- Eq64 encoding: 40-105x speedup (>230 MB/s throughput)
- Shq64 (SimHash): 1.7-9.7x faster with parallel processing
- Zoq64 (Z-order): 60-1600x faster with efficient bit manipulation
- T8q64 (Top-k): Optimized sparse vector handling
Benchmark Results
Performance comparison (click to expand)
Embedding Size: 1024 bytes (256 dimensions × 4 bytes)
Hardware: Apple M1 Pro / Intel i7-9750H
=====================================
Method Implementation Time (μs) Throughput (MB/s) Speedup
---------------------------------------------------------------------
Eq64 Pure Python 464.82 2.20 1.0x
Eq64 Native Rust 4.37 234.42 105.4x
Shq64 Pure Python 1431.33 0.72 1.0x
Shq64 Native Rust 139.79 7.33 10.2x
Zoq64 Pure Python 73.59 13.91 1.0x
Zoq64 Native Rust 0.63 1631.92 116.8x
T8q64 Pure Python 892.45 1.15 1.0x
T8q64 Native Rust 42.18 24.31 21.2x
Installation
Install the latest release from PyPI:
pip install uubed
Or, to install the latest development version from this repository:
pip install git+https://github.com/twardoch/uubed.git
Development
To set up a development environment, you will need Python 3.10+ and Rust.
-
Clone the repository:
git clone https://github.com/twardoch/uubed.git cd uubed
-
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate
-
Install the package in editable mode:
maturin develop -
Run the tests:
pytest
🎯 Quick Start
import numpy as np
from uubed import encode, decode
# Create an embedding
embedding = np.random.randint(0, 256, 256, dtype=np.uint8)
# Full precision encoding
full_code = encode(embedding, method="eq64")
print(f"Full: {full_code[:50]}...") # AQgxASgz...
# Compact similarity hash
compact_code = encode(embedding, method="shq64")
print(f"Compact: {compact_code}") # 16 chars preserving similarity
# Decode back to original
decoded = decode(full_code)
assert np.array_equal(embedding, np.frombuffer(decoded, dtype=np.uint8))
🧩 Encoding Methods
Eq64 - Full Embeddings
- Use case: Need exact values
- Size: 2 chars per byte
- Features: Lossless, supports decode
Shq64 - SimHash
- Use case: Fast similarity search
- Size: 16 characters (64-bit hash)
- Features: Preserves cosine similarity
T8q64 - Top-k Indices
- Use case: Sparse representations
- Size: 16 characters (8 indices)
- Features: Captures most important dimensions
Zoq64 - Z-order
- Use case: Spatial/prefix search
- Size: 8 characters
- Features: Nearby points share prefixes
💡 Why QuadB64?
The Problem
Regular Base64 encoding in search engines causes substring pollution:
# Regular Base64
encode("Hello") → "SGVsbG8="
search("Vsb") → Matches! (false positive)
The Solution
QuadB64 uses position-dependent alphabets:
# QuadB64
encode("Hello") → "EYm1.Gcm8" # Note the dot separator
search("Ym1") → No match (different positions use different alphabets)
Position-safe alphabets:
- Position 0,4,8...:
ABCDEFGHIJKLMNOP - Position 1,5,9...:
QRSTUVWXYZabcdef - Position 2,6,10..:
ghijklmnopqrstuv - Position 3,7,11..:
wxyz0123456789-_
The dot separator every 4 characters ensures position alignment and prevents arbitrary substring matches.
🔌 Integration Examples
With Elasticsearch
# Index embeddings without substring pollution
doc = {
"id": "123",
"embedding_code": encode(embedding, method="eq64")
}
# Search works correctly - no false matches
es.search(body={
"query": {"term": {"embedding_code": target_code}}
})
With Vector Databases
# Store with Pinecone/Weaviate/Qdrant
encoded = encode(embedding, method="shq64")
index.upsert(
id="doc123",
values=embedding.tolist(),
metadata={"q64_code": encoded}
)
🛠️ Development
# Setup development environment
pip install hatch
hatch shell
# Run tests
hatch test
# Run benchmarks
python benchmarks/bench_encoders.py
# Build native module
maturin develop --release
📈 Latest Benchmarks
Performance tests run nightly via GitHub Actions. View the latest benchmark results.
Memory Efficiency
Memory usage comparison (click to expand)
Method Input Size Encoded Size Compression Memory Overhead
------------------------------------------------------------------------
Eq64 1024 bytes 2048 chars 2.0x < 1%
Shq64 1024 bytes 16 chars 0.016x < 1%
T8q64 1024 bytes 16 chars 0.016x < 1%
Zoq64 1024 bytes 8 chars 0.008x < 1%
🤝 Contributing
Contributions welcome! Please read our Contributing Guide for details.
📜 License
MIT License - see LICENSE for details.
🙏 Acknowledgments
Built with:
- PyO3 - Rust bindings for Python
- Maturin - Build and publish Rust Python extensions
- Rayon - Data parallelism for Rust
📚 Learn More
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uubed_project-0.1.0.tar.gz.
File metadata
- Download URL: uubed_project-0.1.0.tar.gz
- Upload date:
- Size: 167.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2257e6a7f83fb3ce8b078bb3993a2cfecf76a1c185b2d9415168b9303cff676
|
|
| MD5 |
cd6ea65136bb617c4bb50cbecdead6dd
|
|
| BLAKE2b-256 |
f5321e5090ed6d1e57164e121fbc8feb1e84d4a18e5c198cf2d37ce54458cf3f
|
File details
Details for the file uubed_project-0.1.0-py3-none-any.whl.
File metadata
- Download URL: uubed_project-0.1.0-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2929ca26587a791c97e47837f6c19e2835be525010b4f8293e10f6170ecf559e
|
|
| MD5 |
cc5d8be8ce598d8ba562b9ef2718cc00
|
|
| BLAKE2b-256 |
f39059050b747671202a09eacc956353e7d1bd2b805bb38c1b144bfc5770f1c0
|