High-Performance Disk-Aware Vector Search Library

These details have not been verified by PyPI

Project links

Project description

Caliby 🚀

High-Performance Vector Search with Efficient Larger-Than-Memory Support

Caliby is a high-performance vector similarity search library that efficiently handles datasets larger than available memory. Built on an innovative buffer pool design, Caliby delivers best-in-class in-memory performance when data fits in RAM and graceful degradation when it doesn't — without requiring expensive hardware or complex distributed systems.

✨ Key Features

🔥 In-Memory Speed: Matches or exceeds HNSWLib/Faiss/Usearch performance when data fits in memory
💾 Larger-Than-Memory: Seamlessly handles datasets that exceed RAM with minimal performance loss
🎯 Multiple Index Types: HNSW, DISKANN, and IVF+PQ
🔧 Embeddable: Single-process library, no server required

🚀 Quick Start

Prerequisites

Caliby requires the following system dependencies:

C++17 compatible compiler (GCC 9+ or Clang 10+)
CMake 3.15+
OpenMP
Abseil C++ library
Python 3.8+

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y build-essential cmake libomp-dev libabsl-dev python3-dev

Fedora/RHEL:

sudo dnf install -y gcc-c++ cmake libomp-devel abseil-cpp-devel python3-devel

Installation

Build from source:

git clone --recursive https://github.com/zxjcarrot/caliby.git
cd caliby
pip install -e .

Note: The --recursive flag is required to initialize the pybind11 submodule. If you already cloned without it, run:

git submodule update --init --recursive

Basic Usage

import caliby
import numpy as np

# Initialize the system and configure buffer pool
caliby.set_buffer_config(size_gb=1.0)  # Set buffer pool size
caliby.open('/tmp/caliby_data')  # Initialize catalog

# Create an HNSW index
index = caliby.HnswIndex(
    max_elements=1_000_000,     # Maximum number of vectors
    dim=128,                    # Vector dimension
    M=16,                       # HNSW parameter (connections per node)
    ef_construction=200,        # Construction-time search depth
    enable_prefetch=True,       # Enable prefetching for performance
    skip_recovery=False,        # Whether to skip recovery from disk
    index_id=0,                 # Unique index identifier for multi-index
    name='user_embeddings',     # Optional human-readable name
)

# Add vectors (batch)
vectors = np.random.rand(10000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4)  # Parallel insertion

# Get index info
print(f"Index name: {index.get_name()}")  # Output: 'user_embeddings'
print(f"Dimension: {index.get_dim()}")

# Search (single query)
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, ef_search_param=50)

# Batch search (parallel)
queries = np.random.rand(100, 128).astype(np.float32)
results = index.search_knn_parallel(queries, k=10, ef_search_param=50, num_threads=4)

# Close when done
caliby.close()

🏗️ Index Types

HNSW (Hierarchical Navigable Small World)

Best for: High recall requirements, moderate to large dataset sizes

import caliby
import numpy as np

# Initialize system
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')

index = caliby.HnswIndex(
    max_elements=1_000_000,
    dim=128,
    M=16,                    # Higher = better recall, more memory
    ef_construction=200,     # Higher = better graph quality, slower build
    enable_prefetch=True,    # Enable prefetching
    skip_recovery=False,
    index_id=0,              # Unique ID for multi-index support
    name='my_vectors',       # Optional human-readable name
)

# Add points
vectors = np.random.rand(100000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4)

# Search with ef_search_param
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, ef_search_param=100)

DiskANN (Vamana Graph)

Best for: Filtered search, dynamic updates, very large graphs with tags/labels

import caliby
import numpy as np

# Initialize system
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')

# Create DiskANN index
index = caliby.DiskANN(
    dimensions=128,
    max_elements=1_000_000,
    R_max_degree=64,    # Max graph degree (R)
    is_dynamic=True     # Enable dynamic inserts/deletes
)

# Build index with tags for filtering
vectors = np.random.rand(100000, 128).astype(np.float32)
tags = [[i % 100] for i in range(100000)]  # Tags for filtering

params = caliby.BuildParams()
params.L_build = 100       # Build-time search depth
params.alpha = 1.2         # Alpha parameter for Vamana
params.num_threads = 4

index.build(vectors, tags, params)

# Search with params
search_params = caliby.SearchParams(L_search=50)
search_params.beam_width = 4

query = np.random.rand(128).astype(np.float32)
labels, distances = index.search(query, K=10, params=search_params)

# Filtered search (only return vectors with specific tag)
labels, distances = index.search_with_filter(query, filter_label=42, K=10, params=search_params)

# Dynamic operations (if is_dynamic=True)
new_point = np.random.rand(128).astype(np.float32)
index.insert_point(new_point, tags=[99], external_id=100000)
index.lazy_delete(external_id=100000)
index.consolidate_deletes(params)

IVF+PQ (Inverted File with Product Quantization)

Best for: Very large datasets (10M+ vectors), memory-constrained environments

import caliby
import numpy as np

# Initialize system with buffer pool
caliby.set_buffer_config(size_gb=0.5)  # Small buffer for large datasets
caliby.open('/tmp/caliby_data')

index = caliby.IVFPQIndex(
    max_elements=10_000_000,
    dim=128,
    num_clusters=256,           # Number of IVF clusters (K)
    num_subquantizers=8,        # Number of PQ subquantizers (M), dim must be divisible by this
    retrain_interval=10000,     # Retrain centroids every N insertions
    skip_recovery=False,
    index_id=0,
    name='large_dataset'
)

# Train the index first (required for IVF+PQ)
training_data = np.random.rand(50000, 128).astype(np.float32)
index.train(training_data)

# Add points (after training)
vectors = np.random.rand(1000000, 128).astype(np.float32)
index.add_points(vectors, num_threads=4)

# Search with nprobe parameter
query = np.random.rand(128).astype(np.float32)
labels, distances = index.search_knn(query, k=10, nprobe=8)

🔧 Advanced Configuration

Multi-Index Support

Create and manage multiple independent indexes with unique IDs and names:

import caliby
import numpy as np

# Initialize system once
caliby.set_buffer_config(size_gb=2.0)
caliby.open('/tmp/caliby_data')

# Create multiple indexes with unique IDs and names
user_index = caliby.HnswIndex(
    max_elements=100_000, dim=128, M=16, ef_construction=200,
    enable_prefetch=True, skip_recovery=True, index_id=1, name='user_embeddings'
)

product_index = caliby.HnswIndex(
    max_elements=200_000, dim=256, M=16, ef_construction=200,
    enable_prefetch=True, skip_recovery=True, index_id=2, name='product_embeddings'
)

# Access index by name
print(f"Working with: {user_index.get_name()}")
print(f"Dimension: {user_index.get_dim()}")

# Each index operates independently
user_vectors = np.random.rand(10000, 128).astype(np.float32)
product_vectors = np.random.rand(15000, 256).astype(np.float32)
user_index.add_points(user_vectors, num_threads=4)
product_index.add_points(product_vectors, num_threads=4)

Persistence & Recovery

import caliby

# Indexes are automatically persisted via the buffer pool
caliby.set_buffer_config(size_gb=1.0)
caliby.open('/path/to/caliby_data')  # Data directory for persistent storage

# Create index (will be persisted automatically)
index = caliby.HnswIndex(
    max_elements=1_000_000,
    dim=128,
    M=16,
    ef_construction=200,
    enable_prefetch=True,
    skip_recovery=False,  # Set to False to enable recovery
    index_id=1,
    name='my_index'
)

# Manual flush to ensure all data is written
index.flush()

# Recovery happens automatically when reopening with same directory
caliby.close()

# Later: reopen and recover
caliby.open('/path/to/caliby_data')
recovered_index = caliby.HnswIndex(
    max_elements=1_000_000,
    dim=128,
    M=16,
    ef_construction=200,
    enable_prefetch=True,
    skip_recovery=False,  # Will recover existing index
    index_id=1,  # Must match original
    name='my_index'
)

if recovered_index.was_recovered():
    print("Index successfully recovered from disk!")

Concurrent Access

# Thread-safe by default
from concurrent.futures import ThreadPoolExecutor

def search_worker(query):
    return index.search(query, k=10)

with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(search_worker, queries))

📁 Project Structure

caliby/
├── include/caliby/          # C++ headers
│   ├── calico.hpp           # Core buffer pool system
│   ├── hnsw.hpp             # HNSW index
│   ├── ivfpq.hpp            # IVF+PQ index
│   ├── diskann.hpp          # DiskANN index (experimental)
│   ├── catalog.hpp          # Index catalog management
│   └── distance.hpp         # Distance functions
├── src/                     # C++ implementation
│   ├── bindings.cpp         # Python bindings
│   ├── hnsw.cpp
│   ├── ivfpq.cpp
│   └── calico.cpp
├── examples/                # Usage examples
├── benchmark/               # Performance benchmarks
├── tests/                   # Python tests
└── third_party/             # Dependencies
    └── pybind11/            # Python binding library (submodule)

🛠️ Building from Source

Prerequisites

Linux (Ubuntu 20.04+ recommended)
GCC 10+ or Clang 12+
CMake 3.16+
Python 3.8+ with development headers
libaio-dev

# Ubuntu/Debian
sudo apt-get install build-essential cmake python3-dev libaio-dev

# Enable huge pages (recommended for performance)
sudo sysctl -w vm.nr_hugepages=1024

Build

git clone https://github.com/zxjcarrot/caliby.git
cd caliby
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Install Python package
cd ..
pip install -e .

Run Tests

# C++ tests
cd build && ctest --output-on-failure

# Python tests
pytest python/tests/

📚 Documentation[WORK IN PROGRESS]

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📬 Contact

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: xinjing@mit.edu

⭐ If you find Caliby useful, please consider giving it a star!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

May 28, 2026

0.1.2

Feb 28, 2026

0.1.1

Jan 23, 2026

This version

0.1.0

Jan 23, 2026

0.1.0.dev20260210001438 pre-release

Feb 10, 2026

0.1.0.dev20260131031530 pre-release

Jan 31, 2026

0.1.0.dev20260129183920 pre-release

Jan 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caliby-0.1.0.tar.gz (1.1 MB view details)

Uploaded Jan 23, 2026 Source

File details

Details for the file caliby-0.1.0.tar.gz.

File metadata

Download URL: caliby-0.1.0.tar.gz
Upload date: Jan 23, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for caliby-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2a1bd8602ac623e70cf6c92038cd09d27aaa6553acdba803bd22570a4da1940a`
MD5	`be1ce5e54f0918f31e2a769414075d9e`
BLAKE2b-256	`1df24376fa2f103e98ae4cf2d629dc3d7828519ce0483f94cf11505c8c7bfe78`

See more details on using hashes here.

caliby 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Caliby 🚀

✨ Key Features

🚀 Quick Start

Prerequisites

Installation

Basic Usage

🏗️ Index Types

HNSW (Hierarchical Navigable Small World)

DiskANN (Vamana Graph)

IVF+PQ (Inverted File with Product Quantization)

🔧 Advanced Configuration

Multi-Index Support

Persistence & Recovery

Concurrent Access

📁 Project Structure

🛠️ Building from Source

Prerequisites

Build

Run Tests

📚 Documentation[WORK IN PROGRESS]

🤝 Contributing

📄 License

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes