Skip to main content

Main project hub for the uubed high-performance embedding encoding library

Project description

uubed - High-Performance Position-Safe Embeddings

Orchestrate Builds PyPI Crates.io Python Version License

uubed (pronounced "you-you-bed") is a high-performance library for encoding embedding vectors into position-safe strings that solve the "substring pollution" problem in search systems.

🏗️ Project Structure

This is the main repository for the uubed project. The implementation is split across multiple repositories:

  • uubed (this repo) - Project coordination and documentation
  • uubed-rs - High-performance Rust implementation
  • uubed-py - Python bindings and API
  • uubed-docs - Comprehensive documentation and book

🚀 Key Features

  • Position-Safe Encoding: QuadB64 family prevents false substring matches
  • Blazing Fast: 40-105x faster than pure Python with Rust acceleration
  • Multiple Encoding Methods: Full precision, SimHash, Top-k, Z-order
  • Search Engine Friendly: No more substring pollution in Elasticsearch/Solr
  • Easy Integration: Simple API, works with any vector database

📊 Performance

With native Rust acceleration:

  • Eq64 encoding: 40-105x speedup (>230 MB/s throughput)
  • Shq64 (SimHash): 1.7-9.7x faster with parallel processing
  • Zoq64 (Z-order): 60-1600x faster with efficient bit manipulation
  • T8q64 (Top-k): Optimized sparse vector handling

Benchmark Results

Performance comparison (click to expand)
Embedding Size: 1024 bytes (256 dimensions × 4 bytes)
Hardware: Apple M1 Pro / Intel i7-9750H
=====================================
Method    Implementation    Time (μs)    Throughput (MB/s)    Speedup
---------------------------------------------------------------------
Eq64      Pure Python       464.82       2.20                 1.0x
Eq64      Native Rust       4.37         234.42               105.4x

Shq64     Pure Python       1431.33      0.72                 1.0x  
Shq64     Native Rust       139.79       7.33                 10.2x

Zoq64     Pure Python       73.59        13.91                1.0x
Zoq64     Native Rust       0.63         1631.92              116.8x

T8q64     Pure Python       892.45       1.15                 1.0x
T8q64     Native Rust       42.18        24.31                21.2x

Installation

Install the latest release from PyPI:

pip install uubed

Or, to install the latest development version from this repository:

pip install git+https://github.com/twardoch/uubed.git

Development

To set up a development environment, you will need Python 3.10+ and Rust.

  1. Clone the repository:

    git clone https://github.com/twardoch/uubed.git
    cd uubed
    
  2. Create a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
    
  3. Install the package in editable mode:

    maturin develop
    
  4. Run the tests:

    pytest
    

🎯 Quick Start

import numpy as np
from uubed import encode, decode

# Create an embedding
embedding = np.random.randint(0, 256, 256, dtype=np.uint8)

# Full precision encoding
full_code = encode(embedding, method="eq64")
print(f"Full: {full_code[:50]}...")  # AQgxASgz...

# Compact similarity hash
compact_code = encode(embedding, method="shq64")
print(f"Compact: {compact_code}")  # 16 chars preserving similarity

# Decode back to original
decoded = decode(full_code)
assert np.array_equal(embedding, np.frombuffer(decoded, dtype=np.uint8))

🧩 Encoding Methods

Eq64 - Full Embeddings

  • Use case: Need exact values
  • Size: 2 chars per byte
  • Features: Lossless, supports decode

Shq64 - SimHash

  • Use case: Fast similarity search
  • Size: 16 characters (64-bit hash)
  • Features: Preserves cosine similarity

T8q64 - Top-k Indices

  • Use case: Sparse representations
  • Size: 16 characters (8 indices)
  • Features: Captures most important dimensions

Zoq64 - Z-order

  • Use case: Spatial/prefix search
  • Size: 8 characters
  • Features: Nearby points share prefixes

💡 Why QuadB64?

The Problem

Regular Base64 encoding in search engines causes substring pollution:

Substring Pollution Problem
# Regular Base64
encode("Hello")  "SGVsbG8="
search("Vsb")  Matches! (false positive)

The Solution

QuadB64 uses position-dependent alphabets:

QuadB64 Solution
# QuadB64
encode("Hello")  "EYm1.Gcm8"  # Note the dot separator
search("Ym1")  No match (different positions use different alphabets)

Position-safe alphabets:

  • Position 0,4,8...: ABCDEFGHIJKLMNOP
  • Position 1,5,9...: QRSTUVWXYZabcdef
  • Position 2,6,10..: ghijklmnopqrstuv
  • Position 3,7,11..: wxyz0123456789-_

The dot separator every 4 characters ensures position alignment and prevents arbitrary substring matches.

🔌 Integration Examples

With Elasticsearch

# Index embeddings without substring pollution
doc = {
    "id": "123",
    "embedding_code": encode(embedding, method="eq64")
}

# Search works correctly - no false matches
es.search(body={
    "query": {"term": {"embedding_code": target_code}}
})

With Vector Databases

# Store with Pinecone/Weaviate/Qdrant
encoded = encode(embedding, method="shq64")
index.upsert(
    id="doc123",
    values=embedding.tolist(),
    metadata={"q64_code": encoded}
)

🛠️ Development

# Setup development environment
pip install hatch
hatch shell

# Run tests
hatch test

# Run benchmarks
python benchmarks/bench_encoders.py

# Build native module
maturin develop --release

📈 Latest Benchmarks

Performance tests run nightly via GitHub Actions. View the latest benchmark results.

Memory Efficiency

Memory usage comparison (click to expand)
Method      Input Size    Encoded Size    Compression    Memory Overhead
------------------------------------------------------------------------
Eq64        1024 bytes    2048 chars      2.0x           < 1%
Shq64       1024 bytes    16 chars        0.016x         < 1%
T8q64       1024 bytes    16 chars        0.016x         < 1%
Zoq64       1024 bytes    8 chars         0.008x         < 1%

🤝 Contributing

Contributions welcome! Please read our Contributing Guide for details.

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Built with:

  • PyO3 - Rust bindings for Python
  • Maturin - Build and publish Rust Python extensions
  • Rayon - Data parallelism for Rust

📚 Learn More

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uubed_project-0.1.0.tar.gz (167.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uubed_project-0.1.0-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file uubed_project-0.1.0.tar.gz.

File metadata

  • Download URL: uubed_project-0.1.0.tar.gz
  • Upload date:
  • Size: 167.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for uubed_project-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d2257e6a7f83fb3ce8b078bb3993a2cfecf76a1c185b2d9415168b9303cff676
MD5 cd6ea65136bb617c4bb50cbecdead6dd
BLAKE2b-256 f5321e5090ed6d1e57164e121fbc8feb1e84d4a18e5c198cf2d37ce54458cf3f

See more details on using hashes here.

File details

Details for the file uubed_project-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for uubed_project-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2929ca26587a791c97e47837f6c19e2835be525010b4f8293e10f6170ecf559e
MD5 cc5d8be8ce598d8ba562b9ef2718cc00
BLAKE2b-256 f39059050b747671202a09eacc956353e7d1bd2b805bb38c1b144bfc5770f1c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page