High-Performance Disk-Aware Vector Search Library
Project description
Caliby ๐
High-Performance Vector Search with Efficient Larger-Than-Memory Support
Caliby is a high-performance vector similarity search library that efficiently handles datasets larger than available memory. Built on an innovative buffer pool design, Caliby delivers best-in-class in-memory performance when data fits in RAM and graceful degradation when it doesn't โ without requiring expensive hardware or complex distributed systems.
โจ Key Features
- ๐ฅ In-Memory Speed: Matches or exceeds HNSWLib/Faiss/Usearch performance when data fits in memory
- ๐พ Larger-Than-Memory: Seamlessly handles datasets that exceed RAM with minimal performance loss
- ๐ฏ Multiple Index Types: HNSW, DISKANN, and IVF+PQ
- ๐ง Embeddable: Single-process library, no server required
๐ Quick Start
Prerequisites
Caliby requires the following system dependencies:
- C++17 compatible compiler (GCC 9+ or Clang 10+)
- CMake 3.15+
- OpenMP
- Abseil C++ library
- Python 3.8+
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y build-essential cmake libomp-dev libabsl-dev python3-dev
Fedora/RHEL:
sudo dnf install -y gcc-c++ cmake libomp-devel abseil-cpp-devel python3-devel
Installation
Build from source:
git clone --recursive https://github.com/zxjcarrot/caliby.git
cd caliby
pip install -e .
Note: The --recursive flag is required to initialize the pybind11 submodule. If you already cloned without it, run:
git submodule update --init --recursive
Basic Usage
import caliby
import numpy as np
# Initialize the system and configure buffer pool
caliby.set_buffer_config(size_gb=1.0) # Set buffer pool size
caliby.open('/tmp/caliby_data') # Initialize catalog
# Create an HNSW index
index = caliby.HnswIndex(
max_elements=1_000_000, # Maximum number of vectors
dim=128, # Vector dimension
M=16, # HNSW parameter (connections per node)
ef_construction=200, # Construction-time search depth
enable_prefetch=True, # Enable prefetching for performance
skip_recovery=False, # Whether to skip recovery from disk
index_id=0, # Unique index identifier for multi-index
name='user_embeddings', # Optional human-readable name
)
# Add vectors (batch)
vectors = np.random.rand(10000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4) # Parallel insertion
# Get index info
print(f"Index name: {index.get_name()}") # Output: 'user_embeddings'
print(f"Dimension: {index.get_dim()}")
# Search (single query)
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, ef_search_param=50)
# Batch search (parallel)
queries = np.random.rand(100, 128).astype(np.float32)
results = index.search_knn_parallel(queries, k=10, ef_search_param=50, num_threads=4)
# Close when done
caliby.close()
๐๏ธ Index Types
HNSW (Hierarchical Navigable Small World)
Best for: High recall requirements, moderate to large dataset sizes
import caliby
import numpy as np
# Initialize system
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')
index = caliby.HnswIndex(
max_elements=1_000_000,
dim=128,
M=16, # Higher = better recall, more memory
ef_construction=200, # Higher = better graph quality, slower build
enable_prefetch=True, # Enable prefetching
skip_recovery=False,
index_id=0, # Unique ID for multi-index support
name='my_vectors', # Optional human-readable name
)
# Add points
vectors = np.random.rand(100000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4)
# Search with ef_search_param
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, ef_search_param=100)
DiskANN (Vamana Graph)
Best for: Filtered search, dynamic updates, very large graphs with tags/labels
import caliby
import numpy as np
# Initialize system
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')
# Create DiskANN index
index = caliby.DiskANN(
dimensions=128,
max_elements=1_000_000,
R_max_degree=64, # Max graph degree (R)
is_dynamic=True # Enable dynamic inserts/deletes
)
# Build index with tags for filtering
vectors = np.random.rand(100000, 128).astype(np.float32)
tags = [[i % 100] for i in range(100000)] # Tags for filtering
params = caliby.BuildParams()
params.L_build = 100 # Build-time search depth
params.alpha = 1.2 # Alpha parameter for Vamana
params.num_threads = 4
index.build(vectors, tags, params)
# Search with params
search_params = caliby.SearchParams(L_search=50)
search_params.beam_width = 4
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search(query, K=10, params=search_params)
# Filtered search (only return vectors with specific tag)
labels, distances = index.search_with_filter(query, filter_label=42, K=10, params=search_params)
# Dynamic operations (if is_dynamic=True)
new_point = np.random.rand(128).astype(np.float32)
index.insert_point(new_point, tags=[99], external_id=100000)
index.lazy_delete(external_id=100000)
index.consolidate_deletes(params)
IVF+PQ (Inverted File with Product Quantization)
Best for: Very large datasets (10M+ vectors), memory-constrained environments
import caliby
import numpy as np
# Initialize system with buffer pool
caliby.set_buffer_config(size_gb=0.5) # Small buffer for large datasets
caliby.open('/tmp/caliby_data')
index = caliby.IVFPQIndex(
max_elements=10_000_000,
dim=128,
num_clusters=256, # Number of IVF clusters (K)
num_subquantizers=8, # Number of PQ subquantizers (M), dim must be divisible by this
retrain_interval=10000, # Retrain centroids every N insertions
skip_recovery=False,
index_id=0,
name='large_dataset'
)
# Train the index first (required for IVF+PQ)
training_data = np.random.rand(50000, 128).astype(np.float32)
index.train(training_data)
# Add points (after training)
vectors = np.random.rand(1000000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4)
# Search with nprobe parameter
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, nprobe=8)
๐ง Advanced Configuration
Multi-Index Support
Create and manage multiple independent indexes with unique IDs and names:
import caliby
import numpy as np
# Initialize system once
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')
# Create multiple indexes with unique IDs and names
user_index = caliby.HnswIndex(
max_elements=100_000, dim=128, M=16, ef_construction=200,
enable_prefetch=True, skip_recovery=True, index_id=1, name='user_embeddings'
)
product_index = caliby.HnswIndex(
max_elements=200_000, dim=256, M=16, ef_construction=200,
enable_prefetch=True, skip_recovery=True, index_id=2, name='product_embeddings'
)
# Access index by name
print(f"Working with: {user_index.get_name()}")
print(f"Dimension: {user_index.get_dim()}")
# Each index operates independently
user_vectors = np.random.rand(10000, 128).astype(np.float32)
product_vectors = np.random.rand(15000, 256).astype(np.float32)
user_index.add_points(user_vectors, num_threads=4)
product_index.add_points(product_vectors, num_threads=4)
Persistence & Recovery
import caliby
# Indexes are automatically persisted via the buffer pool
caliby.set_buffer_config(size_gb=1.0)
caliby.open('/path/to/caliby_data') # Data directory for persistent storage
# Create index (will be persisted automatically)
index = caliby.HnswIndex(
max_elements=1_000_000,
dim=128,
M=16,
ef_construction=200,
enable_prefetch=True,
skip_recovery=False, # Set to False to enable recovery
index_id=1,
name='my_index'
)
# Manual flush to ensure all data is written
index.flush()
# Recovery happens automatically when reopening with same directory
caliby.close()
# Later: reopen and recover
caliby.open('/path/to/caliby_data')
recovered_index = caliby.HnswIndex(
max_elements=1_000_000,
dim=128,
M=16,
ef_construction=200,
enable_prefetch=True,
skip_recovery=False, # Will recover existing index
index_id=1, # Must match original
name='my_index'
)
if recovered_index.was_recovered():
print("Index successfully recovered from disk!")
Concurrent Access
# Thread-safe by default
from concurrent.futures import ThreadPoolExecutor
def search_worker(query):
return index.search(query, k=10)
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(search_worker, queries))
๐ Project Structure
caliby/
โโโ include/caliby/ # C++ headers
โ โโโ calico.hpp # Core buffer pool system
โ โโโ hnsw.hpp # HNSW index
โ โโโ ivfpq.hpp # IVF+PQ index
โ โโโ diskann.hpp # DiskANN index (experimental)
โ โโโ catalog.hpp # Index catalog management
โ โโโ distance.hpp # Distance functions
โโโ src/ # C++ implementation
โ โโโ bindings.cpp # Python bindings
โ โโโ hnsw.cpp
โ โโโ ivfpq.cpp
โ โโโ calico.cpp
โโโ examples/ # Usage examples
โโโ benchmark/ # Performance benchmarks
โโโ tests/ # Python tests
โโโ third_party/ # Dependencies
โโโ pybind11/ # Python binding library (submodule)
๐ ๏ธ Building from Source
Prerequisites
- Linux (Ubuntu 20.04+ recommended)
- GCC 10+ or Clang 12+
- CMake 3.16+
- Python 3.8+ with development headers
- libaio-dev
# Ubuntu/Debian
sudo apt-get install build-essential cmake python3-dev libaio-dev
# Enable huge pages (recommended for performance)
sudo sysctl -w vm.nr_hugepages=1024
Build
git clone https://github.com/zxjcarrot/caliby.git
cd caliby
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Install Python package
cd ..
pip install -e .
Run Tests
# C++ tests
cd build && ctest --output-on-failure
# Python tests
pytest python/tests/
๐ Documentation[WORK IN PROGRESS]
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ฌ Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: xinjing@mit.edu
โญ If you find Caliby useful, please consider giving it a star!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file caliby-0.1.0.tar.gz.
File metadata
- Download URL: caliby-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a1bd8602ac623e70cf6c92038cd09d27aaa6553acdba803bd22570a4da1940a
|
|
| MD5 |
be1ce5e54f0918f31e2a769414075d9e
|
|
| BLAKE2b-256 |
1df24376fa2f103e98ae4cf2d629dc3d7828519ce0483f94cf11505c8c7bfe78
|