Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write

import ailake
import numpy as np

writer = ailake.TableWriter(
    path="./my_table",
    vector_column="embedding",  # default
    dim=1536,                   # default
    metric="cosine",            # cosine | euclidean | dot_product
    pre_normalize=True,         # normalize to unit L2 at write time (recommended for cosine)
                                # enables NormalizedCosine fast path: 1-dot(a,b), no sqrt
    hnsw_m=16,                  # HNSW connections per node (default 16; 32 = higher recall)
    hnsw_ef_construction=150,   # HNSW build quality (default 150; 400 = max quality)
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32).tolist()

writer.write_batch(texts=texts, embeddings=embeddings)
snapshot_id = writer.commit()

TableWriter parameters

Parameter Default Description
path required Table root path (local or s3://, gs://, az://)
vector_column "embedding" Vector column name
dim 1536 Vector dimension
metric "cosine" cosine, euclidean, dot_product
pre_normalize False Normalize to unit L2 at write time (recommended for cosine). Enables 1-dot(a,b) fast path.
hnsw_m None (=16) HNSW connections per node. Higher → better recall, more memory.
hnsw_ef_construction None (=150) HNSW build pool size. Higher → better quality, slower build.
rabitq False Use RaBitQ flat index instead of HNSW: 1 bit/dim = 16× smaller than F16. Better recall than naive binary quantization. Use with rerank_factor ≥ 3 at search.
rabitq_seed 0 Seed for RaBitQ random rotation matrix.
rabitq_keep_raw True Keep raw F16 vectors for exact reranking (recommended).

HNSW tuning guide:

Goal hnsw_m hnsw_ef_construction
Low latency / high QPS 8 100
General purpose (default) 16 150
High recall (RAG) 24 200
Max recall (medical, legal) 32 400

RaBitQ — extreme compression (1 bit/dim)

RaBitQ is a flat index with no graph construction: 1 bit/dim after a modified Gram-Schmidt orthonormal rotation, yielding better recall than naive binary quantization via an unbiased XOR/popcount IP estimator. Write throughput ~163k vec/s (no k-means, no graph; SIFT-1M measured). Storage: 200 bytes/vector at dim=1536 (15× smaller than F16). Search is sequential O(N) flat scan; shard-level parallelism handled automatically.

Use when storage is the primary constraint or write throughput matters more than recall. Designed for cosine workloads — recall on Euclidean datasets is lower (~0.67 at rerank=3 on SIFT-1M). Pair with rerank_factor ≥ 3 (cosine) or ≥ 10 (Euclidean/complex) to recover precision using the stored raw F16 vectors.

import ailake
import numpy as np

# Write with RaBitQ (keep_raw=True stores F16 vectors for reranking)
writer = ailake.TableWriter(
    path="./rabitq_table",
    dim=1536,
    metric="cosine",
    rabitq=True,
    rabitq_seed=42,       # same seed across all shards → comparable distances
    rabitq_keep_raw=True, # recommended: enables reranking
)
writer.write_batch(texts=texts, embeddings=embeddings)
writer.commit()

# Search with reranking for best recall
results = ailake.search(
    path="./rabitq_table",
    query=query,
    top_k=10,
    rerank_factor=10,  # recommended: ≥ 3 for most cosine, ≥ 10 for complex datasets
)
Index Bytes/vector (dim=1536) Recall@10 cosine (rerank≥3) Write (vec/s)
HNSW (F16) ~3 200 ≥ 0.95 ~50k
IVF-PQ (M=48) ~50 0.90–0.95 ~200k
RaBitQ (no raw) 192 0.70–0.85 ~163k
RaBitQ + raw F16 ~3 264 0.85–0.95 ~163k

Search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32).tolist()

results = ailake.search(
    path="./my_table",
    query=query,
    top_k=10,
)

for r in results:
    print(r["row_id"], r["distance"], r["file"])

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)

# Pass context_xml directly to Claude / GPT-4 as a user message

API

TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")

Opens or creates an AI-Lake table at path. Local filesystem only in this release.

Method Description
write_batch(texts, embeddings) Stage a batch of rows. texts: list[str], embeddings: list[list[float]]
commit() -> int Commit staged batches as a new Iceberg snapshot. Returns snapshot ID.

search(path, query, top_k=10) -> list[dict]

Returns up to top_k nearest neighbours. Each result: {"row_id": int, "distance": float, "file": str}.

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) -> str

Assembles a list of chunk dicts into structured XML ready for LLM input. Deduplicates near-identical chunks and respects the token budget.

Iceberg compatibility

Tables written by ailake are valid Apache Iceberg Spec v2 tables. Any Iceberg-compatible engine (Spark, Trino, DuckDB, PyIceberg) reads the tabular columns normally. The HNSW index lives in an AI-Lake extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.11.tar.gz (158.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.11-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.11.tar.gz.

File metadata

  • Download URL: ailake-0.0.11.tar.gz
  • Upload date:
  • Size: 158.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.11.tar.gz
Algorithm Hash digest
SHA256 ebca68cfebdddf08f58875e2391678ea5793818c8afee5c30a5edeb5a1ec48f0
MD5 c57c16ac159c8b9ccf5b3033728cad16
BLAKE2b-256 f4738d892501175e3b415e990aeef5d88ad04f30770a7247152b1e1c25b0ace7

See more details on using hashes here.

File details

Details for the file ailake-0.0.11-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.11-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.11-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ab67814d15d9dbae719e1f6da41e962f561be003dff0510d7398c4a5c58fc4e2
MD5 b718588e16c8ae2c0320156ee65219ad
BLAKE2b-256 19576973648c6211025f89c5d4b611e8539ad1855a8a8d3f29e69abbd170fced

See more details on using hashes here.

File details

Details for the file ailake-0.0.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.11-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4e78b19c881eb42a0db2a77d7d8e613e7edac6a46132d19e8c759b33d2381bb1
MD5 3ca88b8b4cf00a5582778d768a0a1647
BLAKE2b-256 e5581120a59270039254dc25bc4d77cd6c5c9ff46862650688f024504f0175fc

See more details on using hashes here.

File details

Details for the file ailake-0.0.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.11-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3e9fc189a8a558688660de0031202897cbf64cfe8f8ff9358411be3592bd4430
MD5 7dfbfa19d32cbe15e8b12e30050a4cb1
BLAKE2b-256 bf08c60e94af1960f1fde918a4dbb10a1a34904547eb2c9c316d97708b6940fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page