Skip to main content

Python SDK for Lumina vector search engine

Project description

lumina-data

Python SDK for the Lumina vector search engine. Provides zero-overhead ctypes bindings to the Lumina C++ library for building and searching vector indexes (DiskANN, Bruteforce, IVF).

Requirements

  • Linux x86_64
  • Python >= 3.6

Install

pip install .

Pre-built native libraries are bundled in the package. No compilation needed.

Usage

High-level API (list in, list out)

from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

# Build
n, dim = 10000, 128
vectors = [...]  # list of n*dim floats
ids = list(range(n))

builder = LuminaBuilder(options)
builder.pretrain_from_list(vectors, n, dim)
builder.insert_from_list(vectors, ids, n, dim)
builder.dump("/path/to/index.lmi")
builder.close()

# Search
searcher = LuminaSearcher(options)
searcher.open("/path/to/index.lmi")

query = [0.1, 0.2, ...]  # list of dim floats
distances, labels = searcher.search_list(query, n=1, k=10)

for i in range(len(labels)):
    print("id=%d  distance=%.4f" % (labels[i], distances[i]))

searcher.close()

Raw ctypes API (zero-copy, for performance-critical code)

import ctypes
from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

n, dim, k = 10000, 128, 10

# Build
vectors = (ctypes.c_float * (n * dim))(*data)
ids = (ctypes.c_uint64 * n)(*range(n))

with LuminaBuilder(options) as builder:
    builder.pretrain(vectors, n, dim)
    builder.insert(vectors, ids, n, dim)
    builder.dump("/path/to/index.lmi")

# Search
with LuminaSearcher(options) as searcher:
    searcher.open("/path/to/index.lmi")

    query = (ctypes.c_float * dim)(*query_data)
    distances = (ctypes.c_float * k)()
    labels = (ctypes.c_uint64 * k)()

    searcher.search(query, 1, k, distances, labels,
                    {"diskann.search.list_size": "32"})

    for i in range(k):
        print("id=%d  distance=%.4f" % (labels[i], distances[i]))

Filtered Search

# High-level
distances, labels = searcher.search_with_filter_list(
    query, n=1, k=10, filter_ids=[0, 2, 4, 6, 8])

# Raw ctypes
filter_arr = (ctypes.c_uint64 * 5)(0, 2, 4, 6, 8)
searcher.search_with_filter(
    query_arr, 1, k, filter_arr, 5, distances, labels)

Batch Queries

# High-level
all_queries = [...]  # list of n_queries * dim floats
distances, labels = searcher.search_list(all_queries, n=5, k=10)

# Raw ctypes
queries = (ctypes.c_float * (5 * dim))(*data)
distances = (ctypes.c_float * (5 * k))()
labels = (ctypes.c_uint64 * (5 * k))()
searcher.search(queries, 5, k, distances, labels)

Metadata

from lumina_data import LuminaIndexMeta

# Serialize (compatible with paimon-lumina Java and paimon-cpp)
meta = LuminaIndexMeta({
    "index.dimension": "128",
    "distance.metric": "l2",
    "index.type": "diskann",
    "encoding.type": "rawf32",
})
data = meta.serialize()       # -> bytes (JSON)

# Deserialize
meta = LuminaIndexMeta.deserialize(data)
print(meta.dim, meta.metric)  # 128, MetricType.L2

API Reference

LuminaBuilder

Method Input Description
__init__(options) dict Create builder with native Lumina options.
pretrain(vectors, n, dim) ctypes arrays Pretrain with n vectors.
insert(vectors, ids, n, dim) ctypes arrays Insert vectors with IDs.
pretrain_from_list(vectors, n, dim) Python lists High-level pretrain.
insert_from_list(vectors, ids, n, dim) Python lists High-level insert.
dump(path) str Write index to file.
close() Release native resources. Supports with.

LuminaSearcher

Method Input/Output Description
__init__(options) dict Create searcher.
open(path) str Load index from file.
search(q, n, k, dist, labels, opts) ctypes in/out Raw search.
search_with_filter(q, n, k, fids, fc, dist, labels, opts) ctypes in/out Raw filtered search.
search_list(q, n, k, opts) list in, list out High-level search.
search_with_filter_list(q, n, k, fids, opts) list in, list out High-level filtered search.
get_count() Number of vectors in index.
get_dimension() Vector dimension.
close() Release native resources. Supports with.

Index Options

Key Values Default
index.type bruteforce, diskann, ivf diskann
index.dimension integer 128
distance.metric l2, cosine, inner_product inner_product
encoding.type rawf32, sq8, pq pq
diskann.build.ef_construction integer 1024
diskann.build.neighbor_count integer 64
diskann.build.thread_count integer 32
diskann.search.list_size integer auto (1.5x top_k)
diskann.search.beam_width integer 4

Performance

Query latency compared to native C++ (DiskANN, 100K vectors, dim=128, top-10):

Avg Latency Throughput vs C++
C++ native 0.367 ms 2724 qps baseline
Raw ctypes 0.370 ms 2705 qps +0.8%
High-level API 0.494 ms 2024 qps +34%

Raw ctypes adds < 1% overhead. High-level API overhead comes from list -> ctypes conversion per call.

Packaging & Publishing

Build wheel

pip install wheel setuptools
python setup.py bdist_wheel

Upload to PyPI

pip install twine

# Rename for PyPI (requires manylinux tag)
cd dist
mv lumina_data-0.1.0-*.whl lumina_data-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

# Upload
twine upload dist/*.whl

Install from PyPI

pip install lumina-data

Tests

python3 tests/test_lumina_index.py

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lumina_data-0.1.0-py3-none-manylinux2014_x86_64.whl (14.8 MB view details)

Uploaded Python 3

File details

Details for the file lumina_data-0.1.0-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lumina_data-0.1.0-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d97cb104af0d8490c865dc7631caad1f596e1c0acd50d3fa8ce379d00f5c7122
MD5 0100b38c79f0ea504c430e809df400b0
BLAKE2b-256 240fd73457c90419e7350c9e894e1a891a5f3a13731d4f5259aeccf1e3217565

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page