Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write

import ailake
import numpy as np

writer = ailake.TableWriter(
    path="./my_table",
    vector_column="embedding",  # default
    dim=1536,                   # default
    metric="cosine",            # cosine | euclidean | dot_product
    pre_normalize=True,         # normalize to unit L2 at write time (recommended for cosine)
                                # enables NormalizedCosine fast path: 1-dot(a,b), no sqrt
    hnsw_m=16,                  # HNSW connections per node (default 16; 32 = higher recall)
    hnsw_ef_construction=150,   # HNSW build quality (default 150; 400 = max quality)
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32).tolist()

writer.write_batch(texts=texts, embeddings=embeddings)
snapshot_id = writer.commit()

TableWriter parameters

Parameter Default Description
path required Table root path (local or s3://, gs://, az://)
vector_column "embedding" Vector column name
dim 1536 Vector dimension
metric "cosine" cosine, euclidean, dot_product
pre_normalize False Normalize to unit L2 at write time (recommended for cosine). Enables 1-dot(a,b) fast path.
hnsw_m None (=16) HNSW connections per node. Higher → better recall, more memory.
hnsw_ef_construction None (=150) HNSW build pool size. Higher → better quality, slower build.

HNSW tuning guide:

Goal hnsw_m hnsw_ef_construction
Low latency / high QPS 8 100
General purpose (default) 16 150
High recall (RAG) 24 200
Max recall (medical, legal) 32 400

Search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32).tolist()

results = ailake.search(
    path="./my_table",
    query=query,
    top_k=10,
)

for r in results:
    print(r["row_id"], r["distance"], r["file"])

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)

# Pass context_xml directly to Claude / GPT-4 as a user message

API

TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")

Opens or creates an AI-Lake table at path. Local filesystem only in this release.

Method Description
write_batch(texts, embeddings) Stage a batch of rows. texts: list[str], embeddings: list[list[float]]
commit() -> int Commit staged batches as a new Iceberg snapshot. Returns snapshot ID.

search(path, query, top_k=10) -> list[dict]

Returns up to top_k nearest neighbours. Each result: {"row_id": int, "distance": float, "file": str}.

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) -> str

Assembles a list of chunk dicts into structured XML ready for LLM input. Deduplicates near-identical chunks and respects the token budget.

Iceberg compatibility

Tables written by ailake are valid Apache Iceberg Spec v2 tables. Any Iceberg-compatible engine (Spark, Trino, DuckDB, PyIceberg) reads the tabular columns normally. The vector index (HNSW or IVF-PQ) lives in an AI-Lake extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.15.tar.gz (148.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.15-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.15.tar.gz.

File metadata

  • Download URL: ailake-0.0.15.tar.gz
  • Upload date:
  • Size: 148.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.15.tar.gz
Algorithm Hash digest
SHA256 aa97127c128da508e4b3f1a69bd5cae08aca9436dc0b113fea45f7b4cdaa2d4c
MD5 32920a5f4033be0564ebf4195a3c88c2
BLAKE2b-256 1a35a5ab94820ac211cc02dc430750b1b929f0dc4d6c094d7d1129c7b526c14e

See more details on using hashes here.

File details

Details for the file ailake-0.0.15-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.15-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.15-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5d3e5be6a1ba96b88327a8cb1e5a0fe717d0c03773bcbb80dd8149283c891c4f
MD5 571e98f968d255b012a99fe3944eaeca
BLAKE2b-256 0442eb88c0b92fe171ea262754d3229f46eb0b7f06618bcec0339fb6548f2164

See more details on using hashes here.

File details

Details for the file ailake-0.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b9b96399a0534d591b7ba2e8771c5c8249becd5648acdee1ffc39890c45ff5f5
MD5 b61b57f8e7f776d143152c65601cb5a8
BLAKE2b-256 696b103f11d58c708f7492380f4cbcefb121f0ebedb0796099bb2e76834782ec

See more details on using hashes here.

File details

Details for the file ailake-0.0.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e9d83d2f70966f3cd5fb486a4517912aac4dda89e1110665a15bfd3f80650b9b
MD5 5c429ca3168f696c2a4a0e969a9b6020
BLAKE2b-256 e03ce31d8444663ad5fd68d0b2ca6cde5899c64641bdb36d5f07f2008a506a43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page