Skip to main content

Python SDK for Lumina vector search engine

Project description

lumina-data

Python SDK for the Lumina vector search engine. Provides zero-overhead ctypes bindings to the Lumina C++ library for building and searching vector indexes (DiskANN, Bruteforce, IVF).

Requirements

  • Linux x86_64
  • Python >= 3.6

Install

pip install .

Pre-built native libraries are bundled in the package. No compilation needed.

Usage

High-level API (list in, list out)

from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

# Build
n, dim = 10000, 128
vectors = [...]  # list of n*dim floats
ids = list(range(n))

builder = LuminaBuilder(options)
builder.pretrain_from_list(vectors, n, dim)
builder.insert_from_list(vectors, ids, n, dim)
builder.dump("/path/to/index.lmi")
builder.close()

# Search
searcher = LuminaSearcher(options)
searcher.open("/path/to/index.lmi")

query = [0.1, 0.2, ...]  # list of dim floats
distances, labels = searcher.search_list(query, n=1, k=10)

for i in range(len(labels)):
    print("id=%d  distance=%.4f" % (labels[i], distances[i]))

searcher.close()

Raw ctypes API (zero-copy, for performance-critical code)

import ctypes
from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

n, dim, k = 10000, 128, 10

# Build
vectors = (ctypes.c_float * (n * dim))(*data)
ids = (ctypes.c_uint64 * n)(*range(n))

with LuminaBuilder(options) as builder:
    builder.pretrain(vectors, n, dim)
    builder.insert(vectors, ids, n, dim)
    builder.dump("/path/to/index.lmi")

# Search
with LuminaSearcher(options) as searcher:
    searcher.open("/path/to/index.lmi")

    query = (ctypes.c_float * dim)(*query_data)
    distances = (ctypes.c_float * k)()
    labels = (ctypes.c_uint64 * k)()

    searcher.search(query, 1, k, distances, labels,
                    {"diskann.search.list_size": "32"})

    for i in range(k):
        print("id=%d  distance=%.4f" % (labels[i], distances[i]))

Filtered Search

# High-level
distances, labels = searcher.search_with_filter_list(
    query, n=1, k=10, filter_ids=[0, 2, 4, 6, 8])

# Raw ctypes
filter_arr = (ctypes.c_uint64 * 5)(0, 2, 4, 6, 8)
searcher.search_with_filter(
    query_arr, 1, k, filter_arr, 5, distances, labels)

Batch Queries

# High-level
all_queries = [...]  # list of n_queries * dim floats
distances, labels = searcher.search_list(all_queries, n=5, k=10)

# Raw ctypes
queries = (ctypes.c_float * (5 * dim))(*data)
distances = (ctypes.c_float * (5 * k))()
labels = (ctypes.c_uint64 * (5 * k))()
searcher.search(queries, 5, k, distances, labels)

Metadata

from lumina_data import LuminaIndexMeta

# Serialize (compatible with paimon-lumina Java and paimon-cpp)
meta = LuminaIndexMeta({
    "index.dimension": "128",
    "distance.metric": "l2",
    "index.type": "diskann",
    "encoding.type": "rawf32",
})
data = meta.serialize()       # -> bytes (JSON)

# Deserialize
meta = LuminaIndexMeta.deserialize(data)
print(meta.dim, meta.metric)  # 128, MetricType.L2

API Reference

LuminaBuilder

Method Input Description
__init__(options) dict Create builder with native Lumina options.
pretrain(vectors, n, dim) ctypes arrays Pretrain with n vectors.
insert(vectors, ids, n, dim) ctypes arrays Insert vectors with IDs.
pretrain_from_list(vectors, n, dim) Python lists High-level pretrain.
insert_from_list(vectors, ids, n, dim) Python lists High-level insert.
dump(path) str Write index to file.
close() Release native resources. Supports with.

LuminaSearcher

Method Input/Output Description
__init__(options) dict Create searcher.
open(path) str Load index from file.
search(q, n, k, dist, labels, opts) ctypes in/out Raw search.
search_with_filter(q, n, k, fids, fc, dist, labels, opts) ctypes in/out Raw filtered search.
search_list(q, n, k, opts) list in, list out High-level search.
search_with_filter_list(q, n, k, fids, opts) list in, list out High-level filtered search.
get_count() Number of vectors in index.
get_dimension() Vector dimension.
close() Release native resources. Supports with.

Index Options

Key Values Default
index.type bruteforce, diskann, ivf diskann
index.dimension integer 128
distance.metric l2, cosine, inner_product inner_product
encoding.type rawf32, sq8, pq pq
diskann.build.ef_construction integer 1024
diskann.build.neighbor_count integer 64
diskann.build.thread_count integer 32
diskann.search.list_size integer auto (1.5x top_k)
diskann.search.beam_width integer 4

Performance

Query latency compared to native C++ (DiskANN, 100K vectors, dim=128, top-10):

Avg Latency Throughput vs C++
C++ native 0.367 ms 2724 qps baseline
Raw ctypes 0.370 ms 2705 qps +0.8%
High-level API 0.494 ms 2024 qps +34%

Raw ctypes adds < 1% overhead. High-level API overhead comes from list -> ctypes conversion per call.

Packaging & Publishing

Build wheel

pip install wheel setuptools
python setup.py bdist_wheel

Upload to PyPI

pip install twine

# Rename for PyPI (requires manylinux tag)
cd dist
mv lumina_data-0.1.0-*.whl lumina_data-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

# Upload
twine upload dist/*.whl

Install from PyPI

pip install lumina-data

Tests

python3 tests/test_lumina_index.py

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lumina_data-0.1.0.dev1-cp36-cp36m-manylinux1_x86_64.whl (14.8 MB view details)

Uploaded CPython 3.6m

File details

Details for the file lumina_data-0.1.0.dev1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for lumina_data-0.1.0.dev1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7738f0c7fa7eab37e37694f41b360998290a06abe0500dbf5a57522ecb3acb95
MD5 0817edda1f59cdab2e264a1a09d0a826
BLAKE2b-256 17c8ab75a3ea4dac8e3a5a31dc665d08e6464fedd7eadf2cc30e05923a7f8c57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page