Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write

import ailake
import numpy as np

writer = ailake.TableWriter(
    path="./my_table",
    vector_column="embedding",  # default
    dim=1536,                   # default
    metric="cosine",            # cosine | euclidean | dot_product
    pre_normalize=True,         # normalize to unit L2 at write time (recommended for cosine)
                                # enables NormalizedCosine fast path: 1-dot(a,b), no sqrt
    hnsw_m=16,                  # HNSW connections per node (default 16; 32 = higher recall)
    hnsw_ef_construction=150,   # HNSW build quality (default 150; 400 = max quality)
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32).tolist()

writer.write_batch(texts=texts, embeddings=embeddings)
snapshot_id = writer.commit()

TableWriter parameters

Parameter Default Description
path required Table root path (local or s3://, gs://, az://)
vector_column "embedding" Vector column name
dim 1536 Vector dimension
metric "cosine" cosine, euclidean, dot_product
pre_normalize False Normalize to unit L2 at write time (recommended for cosine). Enables 1-dot(a,b) fast path.
hnsw_m None (=16) HNSW connections per node. Higher → better recall, more memory.
hnsw_ef_construction None (=150) HNSW build pool size. Higher → better quality, slower build.
rabitq False Use RaBitQ flat index instead of HNSW: 1 bit/dim = 16× smaller than F16. Better recall than naive binary quantization. Use with rerank_factor ≥ 3 at search.
rabitq_seed 0 Seed for RaBitQ random rotation matrix.
rabitq_keep_raw True Keep raw F16 vectors for exact reranking (recommended).

HNSW tuning guide:

Goal hnsw_m hnsw_ef_construction
Low latency / high QPS 8 100
General purpose (default) 16 150
High recall (RAG) 24 200
Max recall (medical, legal) 32 400

RaBitQ — extreme compression (1 bit/dim)

RaBitQ is a flat index with no graph construction: 1 bit/dim after a modified Gram-Schmidt orthonormal rotation, yielding better recall than naive binary quantization via an unbiased XOR/popcount IP estimator. Write throughput ~163k vec/s (no k-means, no graph; SIFT-1M measured). Storage: 200 bytes/vector at dim=1536 (15× smaller than F16). Search is sequential O(N) flat scan; shard-level parallelism handled automatically.

Use when storage is the primary constraint or write throughput matters more than recall. Designed for cosine workloads — recall on Euclidean datasets is lower (~0.67 at rerank=3 on SIFT-1M). Pair with rerank_factor ≥ 3 (cosine) or ≥ 10 (Euclidean/complex) to recover precision using the stored raw F16 vectors.

import ailake
import numpy as np

# Write with RaBitQ (keep_raw=True stores F16 vectors for reranking)
writer = ailake.TableWriter(
    path="./rabitq_table",
    dim=1536,
    metric="cosine",
    rabitq=True,
    rabitq_seed=42,       # same seed across all shards → comparable distances
    rabitq_keep_raw=True, # recommended: enables reranking
)
writer.write_batch(texts=texts, embeddings=embeddings)
writer.commit()

# Search with reranking for best recall
results = ailake.search(
    path="./rabitq_table",
    query=query,
    top_k=10,
    rerank_factor=10,  # recommended: ≥ 3 for most cosine, ≥ 10 for complex datasets
)
Index Bytes/vector (dim=1536) Recall@10 cosine (rerank≥3) Write (vec/s)
HNSW (F16) ~3 200 ≥ 0.95 ~50k
IVF-PQ (M=48) ~50 0.90–0.95 ~200k
RaBitQ (no raw) 192 0.70–0.85 ~163k
RaBitQ + raw F16 ~3 264 0.85–0.95 ~163k

Search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32).tolist()

results = ailake.search(
    path="./my_table",
    query=query,
    top_k=10,
)

for r in results:
    print(r["row_id"], r["distance"], r["file"])

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)

# Pass context_xml directly to Claude / GPT-4 as a user message

API

TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")

Opens or creates an AI-Lake table at path. Local filesystem only in this release.

Method Description
write_batch(texts, embeddings) Stage a batch of rows. texts: list[str], embeddings: list[list[float]]
commit() -> int Commit staged batches as a new Iceberg snapshot. Returns snapshot ID.

search(path, query, top_k=10) -> list[dict]

Returns up to top_k nearest neighbours. Each result: {"row_id": int, "distance": float, "file": str}.

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) -> str

Assembles a list of chunk dicts into structured XML ready for LLM input. Deduplicates near-identical chunks and respects the token budget.

Iceberg compatibility

Tables written by ailake are valid Apache Iceberg Spec v2 tables. Any Iceberg-compatible engine (Spark, Trino, DuckDB, PyIceberg) reads the tabular columns normally. The HNSW index lives in an AI-Lake extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.13.tar.gz (158.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.13-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.13-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.13-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.13.tar.gz.

File metadata

  • Download URL: ailake-0.0.13.tar.gz
  • Upload date:
  • Size: 158.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.13.tar.gz
Algorithm Hash digest
SHA256 1c82379bd788f3333f23f9346dc2ef513548e5d21f2b7d6540d1d2b5eccab709
MD5 4e6f9b3a2925f2b97c23423a0ae23d47
BLAKE2b-256 1e5b974960b187f80eb799f69d32b473260c99442020b0e553d55719cb4b6e71

See more details on using hashes here.

File details

Details for the file ailake-0.0.13-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.13-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.13-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4baf1c99277f15ea1850ebe7f6a6b5be83f644a750c77ad9de26b0211e2562ba
MD5 bfbb97fbd3a8accfb5205a7539e50689
BLAKE2b-256 ec86b0ffa4679c7618e8a2ea0f0cef6567c46be865c431dd5559dc13ac1f36a7

See more details on using hashes here.

File details

Details for the file ailake-0.0.13-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.13-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 39852f32eabfcabcd8e74079ae772588f6b26bda4b65cc5fa4dcc4c9a7ba1a98
MD5 480f5c0b7ba41411bd41e09abb37c993
BLAKE2b-256 ad82182e3780a4e5a318161c8f2d95c207d629b29bcb29344420281bf44102e3

See more details on using hashes here.

File details

Details for the file ailake-0.0.13-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.13-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f77d2b4732d96df081deb455f012d7c03528debd6385f8f4501a6c5e5fa654cd
MD5 3ff2a60e517811acf48c574c5a11c64e
BLAKE2b-256 791bb204322587a954eb906f3d312916b539b37d7bf8805496212bdf206a0f40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page