Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write + search — fluent API (recommended)

import ailake
import numpy as np

# Open or create a table
table = ailake.open_table(
    "./my_table",
    dim=1536,
    metric="cosine",          # cosine | euclidean | dot_product | normalized_cosine
    pre_normalize=True,       # normalize at write time; enables fast 1-dot(a,b) path
    hnsw_m=16,                # HNSW connections per node (default 16)
    hnsw_ef_construction=150,
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32)

table.insert(texts, embeddings)   # accepts list or numpy array
snapshot_id = table.commit()

# Pointer-only search (default — backward-compatible)
df      = table.search(embeddings[0], top_k=10).to_pandas()   # row_id, distance, file
lf      = table.search(embeddings[0]).limit(5).to_polars()
results = table.search(embeddings[0]).to_list()   # list[dict]

# Full row data — all Parquet columns + _distance
df_full = table.search(embeddings[0], top_k=10, fetch_data=True).to_pandas()

Async API

import ailake, asyncio
import numpy as np

async def main():
    table = ailake.open_table("./my_table", dim=1536)
    await table.insert_async(texts, embeddings)
    await table.commit_async()

    # fluent async chain
    df = await table.search(query_vec).limit(10).to_pandas_async()

    # parallel searches via asyncio.gather
    r1, r2 = await asyncio.gather(
        table.search(q1).to_list_async(),
        table.search(q2).to_list_async(),
    )

asyncio.run(main())

Module-level search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32)

df     = ailake.search("./my_table", query, top_k=10).to_pandas()
lf     = ailake.search("./my_table", query).limit(5).to_polars()
items  = ailake.search("./my_table", query).to_list()

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)
# Pass context_xml directly to Claude / GPT-4 as a user message

API reference

open_table(path, *, ...) → Table

Opens or creates an AI-Lake table at path.

Parameter Default Description
path required Table root (local, s3://, gs://, az://)
vector_column "embedding" Vector column name
dim 1536 Embedding dimension
metric "cosine" cosine, euclidean, dot_product, normalized_cosine
pre_normalize False Normalize to unit L2 at write; enables 1-dot(a,b) fast path (~12-20 % speedup)
hnsw_m None (=16) HNSW connections per node
hnsw_ef_construction None (=150) HNSW build pool size

Table

Method Description
insert(texts, embeddings) → Table Buffer a batch. embeddings: list[list[float]] or numpy array.
commit() → int Persist as a new Iceberg snapshot; returns snapshot ID.
search(query, top_k=10, fetch_data=False) → SearchQuery Lazy, chainable search. query: list[float] or numpy array. Set fetch_data=True to return full row data.
insert_async(...) Async variant of insert.
commit_async() → int Async variant of commit.

Table is a context manager: with ailake.open_table(...) as t: ...

In Jupyter, table renders a styled HTML card showing path and vector config.

SearchQuery

Lazy result set — no I/O until materialised.

Method Description
limit(n) → SearchQuery Cap to n nearest neighbours (chainable).
to_list() → list[dict] Always pointer-only: [{"row_id": int, "distance": float, "file": str}, ...]
to_arrow() → pyarrow.Table Full row data (all columns + _distance) when fetch_data=True; pointer-only pyarrow.Table with columns row_id, distance, file otherwise.
to_pandas() → pd.DataFrame Full row DataFrame when fetch_data=True; pointer-only otherwise.
to_polars() → pl.DataFrame Full row DataFrame when fetch_data=True; pointer-only otherwise.
to_list_async() Async variant.
to_arrow_async() Async variant.
to_pandas_async() Async variant.
to_polars_async() Async variant.

In Jupyter, results renders as an HTML table when executed, pending state otherwise. When fetch_data=True, the HTML table shows all Parquet columns.

Full-read mode

# Pointer-only (default — backward-compatible)
df = ailake.search("./my_table", query, top_k=10).to_pandas()
# columns: row_id, distance, file

# Full row data — all Parquet columns + _distance
df = ailake.search("./my_table", query, top_k=10, fetch_data=True).to_pandas()
# columns: text, embedding, ..., _distance

# Same via Table handle
df = table.search(query, top_k=10, fetch_data=True).to_pandas()

fetch_data=True reads each matching Parquet file once and uses arrow_select::take to extract only the matched rows — no full table scan.

search(path, query, top_k=10, fetch_data=False) → SearchQuery

Module-level search returning the same chainable SearchQuery.

TableWriter (legacy — still supported)

writer = ailake.TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")
writer.write_batch(texts, embeddings)
snapshot_id = writer.commit()

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) → str

Assembles chunk dicts into structured XML for LLM input. Deduplicates near-identical chunks within the token budget.

HNSW tuning guide

Goal hnsw_m hnsw_ef_construction
Low latency / high QPS 8 100
General purpose (default) 16 150
High recall (RAG) 24 200
Max recall (medical, legal) 32 400

Type checking

Ships py.typed (PEP 561) and ailake/_ailake.pyi stubs. mypy and pyright work out of the box with no configuration.

Iceberg compatibility

Tables are valid Apache Iceberg Spec v2. Spark, Trino, DuckDB, and PyIceberg read tabular columns normally; the HNSW index lives in an extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.16.tar.gz (158.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.16-cp39-abi3-win_amd64.whl (3.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.16-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.16.tar.gz.

File metadata

  • Download URL: ailake-0.0.16.tar.gz
  • Upload date:
  • Size: 158.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.16.tar.gz
Algorithm Hash digest
SHA256 c4f4b188ec57d23fec6bc2b1db6ced27a15920b9c5c200016abf347c503186ef
MD5 d5d805fa907512720060c06e7feb3f57
BLAKE2b-256 0a881a558db34559240c206beca8e6bb14901c421fb61c893c98e8842492d491

See more details on using hashes here.

File details

Details for the file ailake-0.0.16-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.16-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.3 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.16-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2f5e3d6fcf67a45f3c0e51239fd05af9fbb38a0d4d1eb5373c0f418834631db3
MD5 14811bf58ee12cba66cbfb4f6fb3cc9e
BLAKE2b-256 aa3b6c398d2de2a8601fb2ac92821877b4f1256ee6d3170d39ba6361f19526e1

See more details on using hashes here.

File details

Details for the file ailake-0.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8bd891d041f8617d6e9df32d904e78b9e2cce911bffafb06bd4ba4aaca38d8df
MD5 05a09adab07cba081eed6b34f7a33046
BLAKE2b-256 3fec97bac260af07562d5b4fe99b5eb408e021fe753d17109837c58dff076817

See more details on using hashes here.

File details

Details for the file ailake-0.0.16-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.16-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fbe74f1665e374c3085d8c8c76a152cfb2c8fca0bc3b9756233080cbddad4da7
MD5 22452272bd158f523fabff80e8726737
BLAKE2b-256 c972a76e2ea974499883c64a0684af0fcd1502a9ceb02a96ae9c4778b6465246

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page