Skip to main content

Smaller & Faster Single-File Vector Search Engine from Unum

Project description

USearch

Faster & Smaller Single-File
Search Engine for Vectors & Texts


Discord     LinkedIn     Twitter     Blog     GitHub

Euclidean • Angular • Bitwise • Haversine • User-Defined Metrics
C++ 11Python 3JavaScriptJavaRustC 99Objective-CSwiftC#GoLangWolfram
Linux • MacOS • Windows • iOS • Docker • WebAssembly


Comparison with FAISS

FAISS is a widely recognized standard for high-performance vector search engines. USearch and FAISS both employ the same HNSW algorithm, but they differ significantly in their design principles. USearch is compact and broadly compatible without sacrificing performance, with a primary focus on user-defined metrics and fewer dependencies.

FAISS USearch
Implementation 84 K SLOC in faiss/ 3 K SLOC in usearch/
Supported metrics 9 fixed metrics Any User-Defined metrics
Supported languages C++, Python 10 languages
Supported ID types uint32_t, uint64_t uint32_t, uint40_t, uint64_t
Dependencies BLAS, OpenMP None
Bindings SWIG Native
Acceleration Learned Quantization Downcasting

Base functionality is identical to FAISS, and the interface must be familiar if you have ever investigated Approximate Nearest Neighbors search:

$ pip install usearch

import numpy as np
from usearch.index import Index

index = Index(
    ndim=3, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
    dtype='f32', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
    connectivity=16, # Optional: How frequent should the connections in the graph be
    expansion_add=128, # Optional: Control the recall of indexing
    expansion_search=64, # Optional: Control the quality of search
)

vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)
matches = index.search(vector, 10)

assert len(index) == 1
assert len(matches) == 1
assert matches[0].key == 42
assert matches[0].distance <= 0.001
assert np.allclose(index[42], vector)

Comparing the performance of FAISS against USearch on 1 Million 96-dimensional vectors from the famous Deep1B dataset, once can expect the following numbers on modern AWS c7g.metal instances.

FAISS, f32 USearch, f32 USearch, f16 USearch, i8
Batch Insert 16 K/s 73 K/s 100 K/s 104 K/s +550%
Batch Search 82 K/s 103 K/s 113 K/s 134 K/s +63%
Bulk Insert 76 K/s 105 K/s 115 K/s 202 K/s +165%
Bulk Search 118 K/s 174 K/s 173 K/s 304 K/s +157%
Recall @ 10 99% 99.2% 99.1% 99.2%

HNSW was configured with identical hyper-parameters: connectivity M=16, expansion @ construction efConstruction=128, and expansion @ search ef=64. Batch size is 256. Jump to the Performance Tuning section to read about the effects of those hyper-parameters.

User-Defined Functions

While most vector search packages concentrate on just a couple of metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics. This flexibility allows you to customize your search for a myriad of applications, from computing geo-spatial coordinates with the rare Haversine distance to creating custom metrics for composite embeddings from multiple AI models.

USearch: Vector Search Approaches

Unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length. They only have to be comparable. So you can apply it in obscure applications, like searching for similar sets or fuzzy text matching, using GZip as a distance function.

Read more about JIT and UDF in USearch Python SDK.

Memory Efficiency, Downcasting, and Quantization

Training a quantization model and dimension-reduction is a common approach to accelerate vector search. Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts.

USearch uint40_t support

Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors. The same index, and add and search operations will automatically down-cast or up-cast between f32_t, f16_t, f64_t, and i8_t representations, even if the hardware doesn't natively support it. Continuing the topic of memory efficiency, we provide a uint40_t to allow collection with over 4B+ vectors without allocating 8 bytes for every neighbor reference in the proximity graph.

Serialization & Serving Index from Disk

USearch supports multiple forms of serialization:

  • Into a file defined with a path.
  • Into a stream defined with a callback, serializing or reconstructing incrementally.
  • Into a buffer of fixed length, or a memory-mapped file, that supports random access.

The latter allows you to serve indexes from external memory, enabling you to optimize your server choices for indexing speed and serving costs. This can result in 20x cost reduction on AWS and other public clouds.

index.save("index.usearch")

loaded_copy = index.load("index.usearch")
view = Index.restore("index.usearch", view=True)

other_view = Index(ndim=..., metric=CompiledMetric(...))
other_view.view("index.usearch")

Exact vs. Approximate Search

Approximate search methods, such as HNSW, are predominantly used when an exact brute-force search becomes too resource-intensive. This typically occurs when you have millions of entries in a collection. For smaller collections, we offer a more direct approach with the search method.

from usearch.index import search, MetricKind, Matches, BatchMatches
import numpy as np

# Generate 10'000 random vectors with 1024 dimensions
vectors = np.random.rand(10_000, 1024).astype(np.float32)
vector = np.random.rand(1024).astype(np.float32)

one_in_many: Matches = search(vectors, vector, 50, MetricKind.L2sq, exact=True)
many_in_many: BatchMatches = search(vectors, vectors, 50, MetricKind.L2sq, exact=True)

By passing the exact=True argument, the system bypasses indexing altogether and performs a brute-force search through the entire dataset using SIMD-optimized similarity metrics from SimSIMD. When compared to FAISS's IndexFlatL2 in Google Colab, USearch may offer up to a 20x performance improvement:

  • faiss.IndexFlatL2: 55.3 ms.
  • usearch.index.search: 2.54 ms.

Indexes for Multi-Index Lookups

For larger workloads targeting billions or even trillions of vectors, parallel multi-index lookups become invaluable. These lookups prevent the need to construct a single, massive index, allowing users to query multiple smaller ones instead.

from usearch.index import Indexes

multi_index = Indexes(
    indexes: Iterable[usearch.index.Index] = [...],
    paths: Iterable[os.PathLike] = [...],
    view: bool = False,
    threads: int = 0,
)
multi_index.search(...)

Clustering

Once the index is constructed, it can be used to cluster entries much faster. In essense, the Index itself can be seen as a clustering, and it allows iterative deepening.

clustering = index.cluster(
    min_count=10, # Optional
    max_count=15, # Optional
    threads=..., # Optional
)

# Get the clusters and their sizes
centroid_keys, sizes = clustering.centroids_popularity

# Use Matplotlib draw a histogram
clustering.plot_centroids_popularity()

# Export a NetworkX graph of the clusters
g = clustering.network

# Get members of a specific cluster
first_members = clustering.members_of(centroid_keys[0])

# Deepen into that cluster spliting it into more parts, all same arguments supported
sub_clustering = clustering.subcluster(min_count=..., max_count=...)

Using Scikit-Learn, on a 1 Million point dataset, one may expect queries to take anywhere from minutes to hours, depending on the number of clusters you want to highlight. For 50'000 clusters the performance difference between USearch and conventional clustering methods may easily reach 100x.

Joins, One-to-One, One-to-Many, and Many-to-Many Mappings

One of the big questions these days is how will AI change the world of databases and data management. Most databases are still struggling to implement high-quality fuzzy search, and the only kind of joins they know are deterministic. A join is different from searching for every entry, as it requires a one-to-one mapping, banning collisions among separate search results.

Exact Search Fuzzy Search Semantic Search ?
Exact Join Fuzzy Join ? Semantic Join ??

Using USearch one can implement sub-quadratic complexity approximate, fuzzy, and semantic joins. This can come in handy in any fuzzy-matching tasks, common to Database Management Software.

men = Index(...)
women = Index(...)
pairs: dict = men.join(women, max_proposals=0, exact=False)

Read more in post: From Dating to Vector Search - "Stable Marriages" on a Planetary Scale 👩‍❤️‍👨

Functionality

By now, the core functionality is supported across all bindings. Broader functionality is ported per request.

C++ 11 Python 3 C 99 Java JavaScript Rust GoLang Swift
Add, search
Save, load, view
User-defined metrics
Joins
Variable-length vectors
4B+ capacities

Application Examples

USearch + AI = Multi-Modal Semantic Search

USearch Semantic Image Search

AI has a growing number of applications, but one of the coolest classic ideas is to use it for Semantic Search. One can take an encoder model, like the multi-modal UForm, and a web-programming framework, like UCall, and build a text-to-image search platform in just 20 lines of Python.

import ucall
import uform
import usearch

import numpy as np
import PIL as pil

server = ucall.Server()
model = uform.get_model('unum-cloud/uform-vl-multilingual')
index = usearch.index.Index(ndim=256)

@server
def add(key: int, photo: pil.Image.Image):
    image = model.preprocess_image(photo)
    vector = model.encode_image(image).detach().numpy()
    index.add(key, vector.flatten(), copy=True)

@server
def search(query: str) -> np.ndarray:
    tokens = model.preprocess_text(query)
    vector = model.encode_text(tokens).detach().numpy()
    matches = index.search(vector.flatten(), 3)
    return matches.keys

server.run()

A more complete demo with Streamlit is available on GitHub. We have pre-processed some commonly used datasets, cleaned the images, produced the vectors, and pre-built the index.

Dataset Modalities Images Download
Unsplash Images & Descriptions 25 K HuggingFace / Unum
Conceptual Captions Images & Descriptions 3 M HuggingFace / Unum
Arxiv Titles & Abstracts 2 M HuggingFace / Unum

USearch + RDKit = Molecular Search

Comparing molecule graphs and searching for similar structures is expensive and slow. It can be seen as a special case of the NP-Complete Subgraph Isomorphism problem. Luckily, domain-specific approximate methods exist. The one commonly used in Chemistry, is to generate structures from SMILES, and later hash them into binary fingerprints. The latter are searchable with bitwise similarity metrics, like the Tanimoto coefficient. Below is an example using the RDKit package.

from usearch.index import Index, MetricKind
from rdkit import Chem
from rdkit.Chem import AllChem

import numpy as np

molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
encoder = AllChem.GetRDKitFPGenerator()

fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
fingerprints = np.packbits(fingerprints, axis=1)

index = Index(ndim=2048, metric=MetricKind.Tanimoto)
keys = np.arange(len(molecules))

index.add(keys, fingerprints)
matches = index.search(fingerprints, 10)

USearch + POI Coordinates = GIS Applications... on iOS?

USearch Maps with SwiftUI

With Objective-C and Swift iOS bindings, USearch can be easily used in mobile applications. The SwiftVectorSearch project illustrates how to build a dynamic, real-time search system on iOS. In this example, we use 2-dimensional vectors—encoded as latitude and longitude—to find the closest Points of Interest (POIs) on a map. The search is based on the Haversine distance metric, but can easily be extended to support high-dimensional vectors.

Integrations

Citations

@software{Vardanian_USearch_2023,
doi = {10.5281/zenodo.7949416},
author = {Vardanian, Ash},
title = {{USearch by Unum Cloud}},
url = {https://github.com/unum-cloud/usearch},
version = {2.7.8},
year = {2023},
month = oct,
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usearch-2.7.8-cp311-cp311-win_amd64.whl (250.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

usearch-2.7.8-cp311-cp311-manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

usearch-2.7.8-cp311-cp311-manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

usearch-2.7.8-cp310-cp310-win_amd64.whl (249.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

usearch-2.7.8-cp310-cp310-manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

usearch-2.7.8-cp310-cp310-manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

usearch-2.7.8-cp39-cp39-win_amd64.whl (249.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

usearch-2.7.8-cp39-cp39-manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

usearch-2.7.8-cp39-cp39-manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

usearch-2.7.8-cp38-cp38-win_amd64.whl (249.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

usearch-2.7.8-cp38-cp38-manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

usearch-2.7.8-cp38-cp38-manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

usearch-2.7.8-cp37-cp37m-win_amd64.whl (250.6 kB view details)

Uploaded CPython 3.7m Windows x86-64

usearch-2.7.8-cp37-cp37m-manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

usearch-2.7.8-cp37-cp37m-manylinux_2_28_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.28+ ARM64

File details

Details for the file usearch-2.7.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: usearch-2.7.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 250.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for usearch-2.7.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1f964909a47ae397d0e2f08f48efbdf3935e65edeaa1ad2072b9f0dc1eb64289
MD5 0a55d5c1e8bed08879a8c431567057d1
BLAKE2b-256 1df80e1834b56550cd089ae9c26e51e7e80e689593086c32e4ab87176fb22d00

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 59f52be5a341ae7632d32e67681607cdc97cee241f905643458fa8c5c241ac78
MD5 97f118f3199dc665028e61ef70987bb2
BLAKE2b-256 143ab8360ca4e3af02efd59bfc9af0537930feabaf5bbdcd58da6867079bede1

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8eb694549c5ba0b0e32ecd3d3c5a2463be5871c74b3eaaaaa679590e2bdfe2fd
MD5 8df10c7b3a4311d9294980f166bf7eeb
BLAKE2b-256 910fa894a462a9f32d682b71b68a5718730bbd8ac2f50ed9d8c53c8832a1a997

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: usearch-2.7.8-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 249.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for usearch-2.7.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a88586d22317d61fd40edfb56ecc60dc4b9706d02b471daa5f6902c7d5549569
MD5 d09d40aa577e84609bbb64504d6c37b9
BLAKE2b-256 af1ca058a7a4da031cd8de4de54027140e331a1d87e0983b41dc1f0532d534d8

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7269b0a8f268435a0400c9974c5af732b3fbc4e222b2ace0229eb823b0268fda
MD5 97d2b0643872c00a8628a5b6aa52586c
BLAKE2b-256 e01e16e880899531e1ee42bae885a7b55b5c5069f5abdcc4167b4ecaca4659de

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 bb8cbea51a2ff342b19fecb6fd5f5cb7f41bfdf890861b4c4a8973dab5c97e4e
MD5 dd22fcb5cdc425dfec7905c3e3cc6c11
BLAKE2b-256 6c68687abd8cdc6edb29f6a0b4d12235f7efebec0f4740f5f82aa6296d146910

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: usearch-2.7.8-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 249.6 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for usearch-2.7.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fab0ec9f121ce26503cd77df7a0725268a32775cdef23befac9551dca7200dc4
MD5 7c747afb13b139dbf9df73ba1c6cf64a
BLAKE2b-256 6d1de0e6760af33f0e9dfcfea6b091c76e739ab1a1b81af0a10747c714caa62b

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0bbb455e49394621c9a02b3cb531a634698a2c429c3945e7e49806553042f261
MD5 b6cf3cfe415f5d36b7dfb29ba90c34fb
BLAKE2b-256 a26cbb329bac4e1e8cff923486ceb0fbcde00a6b6f0c787bc7ca4d49e2095e7c

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp39-cp39-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 314c6123a971b8a63b711b9b32782798fb4cec7f707e85a6bbce378e0f31fc0b
MD5 3ea6e0edaf65b94975f68215aa4be8df
BLAKE2b-256 70857c5e011969665f9cf8eeb8332cb5b4b9d9e5ba5829c0644b9cf10a869d88

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: usearch-2.7.8-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 249.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for usearch-2.7.8-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 53b95859883d82c78338eda89e3e74245984555a2f3428a4a24bd4d344badf89
MD5 4edc9a293f22bab405650f3ae4563d37
BLAKE2b-256 d90b3381c048be58ed7f81184bd81c9479e5880324bd53093fe702753a2db305

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2d6a2d3058da3c45dec514ff410bd646bb3e27cd769656fb7af8646fc966d1e3
MD5 bb41abe29362c968f1fe3a3d4123669d
BLAKE2b-256 02e8b56cacfcad20ac0d514a6408967bed74f153073b1fe98e0cd9740bff091e

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp38-cp38-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 abce3a798989409f77933c167f1c772972edaea65aceec3f66c48461520e18d8
MD5 02eaaacffa9fdaff21d4c84d56ba49de
BLAKE2b-256 de35ff1bcf69f7dcc878a209b12ffdc7729dcddb3679d5b5fb1fc9593ea29b10

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: usearch-2.7.8-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 250.6 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for usearch-2.7.8-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 62e401bdc60a868e3a04435a2f2e8e6a30bfdc136ce457e4c8a57e54ff6a04d6
MD5 52bee00b6994080351dddd3e47031113
BLAKE2b-256 0ff87ef7568e4c7c5ded06e93d094fdbc34c3fb518b6cd9810523e1d6a34be5f

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp37-cp37m-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 49f9e83d466bd7e42d1ec489aab7a56a16430f8ac58080d2d54ccc585761e46c
MD5 95539fbb79b17ca5f98a0eda6c3ba990
BLAKE2b-256 004ab014ed531408960ef77d53edde63fa61883af8f435387ba55bb1de93297e

See more details on using hashes here.

File details

Details for the file usearch-2.7.8-cp37-cp37m-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for usearch-2.7.8-cp37-cp37m-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5235031e926995d8214de34fb7fc56d49d47fd30c49d0a260aad03e9b7b8aff3
MD5 80a9fe242ea88ec765c231c2d12e9453
BLAKE2b-256 8125a8c5a0d5dc03f618d0442f76d423e2a88bf7feaf5cd26ae3fc38ea0985fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page