Unified tabular + vector storage in a single Iceberg-compatible file
Project description
ailake — AI-Lake Format Python SDK
Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.
Install
pip install ailake
Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.
Quickstart
Write + search — fluent API (recommended)
import ailake
import numpy as np
# Open or create a table
table = ailake.open_table(
"./my_table",
dim=1536,
metric="cosine", # cosine | euclidean | dot_product | normalized_cosine
pre_normalize=True, # normalize at write time; enables fast 1-dot(a,b) path
hnsw_m=16, # HNSW connections per node (default 16)
hnsw_ef_construction=150,
)
texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32)
table.insert(texts, embeddings) # accepts list or numpy array
snapshot_id = table.commit()
# Pointer-only search (default — backward-compatible)
df = table.search(embeddings[0], top_k=10).to_pandas() # row_id, distance, file
lf = table.search(embeddings[0]).limit(5).to_polars()
results = table.search(embeddings[0]).to_list() # list[dict]
# Full row data — all Parquet columns + _distance
df_full = table.search(embeddings[0], top_k=10, fetch_data=True).to_pandas()
Async API
import ailake, asyncio
import numpy as np
async def main():
table = ailake.open_table("./my_table", dim=1536)
await table.insert_async(texts, embeddings)
await table.commit_async()
# fluent async chain
df = await table.search(query_vec).limit(10).to_pandas_async()
# parallel searches via asyncio.gather
r1, r2 = await asyncio.gather(
table.search(q1).to_list_async(),
table.search(q2).to_list_async(),
)
asyncio.run(main())
Module-level search
import ailake
import numpy as np
query = np.random.rand(1536).astype(np.float32)
df = ailake.search("./my_table", query, top_k=10).to_pandas()
lf = ailake.search("./my_table", query).limit(5).to_polars()
items = ailake.search("./my_table", query).to_list()
Assemble context for LLMs
import ailake
chunks = [
{
"document_id": "doc-1",
"chunk_index": 0,
"chunk_text": "AI-Lake stores vectors and tabular data together.",
"document_title": "AI-Lake Overview",
"section_path": "Introduction",
"source_uri": "s3://my-lake/docs/overview.pdf",
"distance": 0.12,
},
]
context_xml = ailake.assemble_context(
chunks=chunks,
max_tokens=4096, # token budget (4 chars ≈ 1 token)
dedup_threshold=0.05, # drop near-duplicate chunks
)
# Pass context_xml directly to Claude / GPT-4 as a user message
API reference
open_table(path, *, ...) → Table
Opens or creates an AI-Lake table at path.
| Parameter | Default | Description |
|---|---|---|
path |
required | Table root (local, s3://, gs://, az://) |
vector_column |
"embedding" |
Vector column name |
dim |
1536 |
Embedding dimension |
metric |
"cosine" |
cosine, euclidean, dot_product, normalized_cosine |
pre_normalize |
False |
Normalize to unit L2 at write; enables 1-dot(a,b) fast path (~12-20 % speedup) |
hnsw_m |
None (=16) |
HNSW connections per node |
hnsw_ef_construction |
None (=150) |
HNSW build pool size |
Table
| Method | Description |
|---|---|
insert(texts, embeddings) → Table |
Buffer a batch. embeddings: list[list[float]] or numpy array. |
commit() → int |
Persist as a new Iceberg snapshot; returns snapshot ID. |
search(query, top_k=10, fetch_data=False) → SearchQuery |
Lazy, chainable search. query: list[float] or numpy array. Set fetch_data=True to return full row data. |
insert_async(...) |
Async variant of insert. |
commit_async() → int |
Async variant of commit. |
Table is a context manager: with ailake.open_table(...) as t: ...
In Jupyter, table renders a styled HTML card showing path and vector config.
SearchQuery
Lazy result set — no I/O until materialised.
| Method | Description |
|---|---|
limit(n) → SearchQuery |
Cap to n nearest neighbours (chainable). |
to_list() → list[dict] |
Always pointer-only: [{"row_id": int, "distance": float, "file": str}, ...] |
to_arrow() → pyarrow.Table |
Full row data (all columns + _distance) when fetch_data=True; pointer-only pyarrow.Table with columns row_id, distance, file otherwise. |
to_pandas() → pd.DataFrame |
Full row DataFrame when fetch_data=True; pointer-only otherwise. |
to_polars() → pl.DataFrame |
Full row DataFrame when fetch_data=True; pointer-only otherwise. |
to_list_async() |
Async variant. |
to_arrow_async() |
Async variant. |
to_pandas_async() |
Async variant. |
to_polars_async() |
Async variant. |
In Jupyter, results renders as an HTML table when executed, pending state otherwise.
When fetch_data=True, the HTML table shows all Parquet columns.
Full-read mode
# Pointer-only (default — backward-compatible)
df = ailake.search("./my_table", query, top_k=10).to_pandas()
# columns: row_id, distance, file
# Full row data — all Parquet columns + _distance
df = ailake.search("./my_table", query, top_k=10, fetch_data=True).to_pandas()
# columns: text, embedding, ..., _distance
# Same via Table handle
df = table.search(query, top_k=10, fetch_data=True).to_pandas()
fetch_data=True reads each matching Parquet file once and uses arrow_select::take to extract only the matched rows — no full table scan.
search(path, query, top_k=10, fetch_data=False) → SearchQuery
Module-level search returning the same chainable SearchQuery.
TableWriter (legacy — still supported)
writer = ailake.TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")
writer.write_batch(texts, embeddings)
snapshot_id = writer.commit()
assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) → str
Assembles chunk dicts into structured XML for LLM input. Deduplicates near-identical chunks within the token budget.
HNSW tuning guide
| Goal | hnsw_m |
hnsw_ef_construction |
|---|---|---|
| Low latency / high QPS | 8 | 100 |
| General purpose (default) | 16 | 150 |
| High recall (RAG) | 24 | 200 |
| Max recall (medical, legal) | 32 | 400 |
Type checking
Ships py.typed (PEP 561) and ailake/_ailake.pyi stubs. mypy and pyright work out of the box with no configuration.
Iceberg compatibility
Tables are valid Apache Iceberg Spec v2. Spark, Trino, DuckDB, and PyIceberg read tabular columns normally; the HNSW index lives in an extension section that standard Parquet readers silently ignore.
License
MIT OR Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ailake-0.0.16.tar.gz.
File metadata
- Download URL: ailake-0.0.16.tar.gz
- Upload date:
- Size: 158.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4f4b188ec57d23fec6bc2b1db6ced27a15920b9c5c200016abf347c503186ef
|
|
| MD5 |
d5d805fa907512720060c06e7feb3f57
|
|
| BLAKE2b-256 |
0a881a558db34559240c206beca8e6bb14901c421fb61c893c98e8842492d491
|
File details
Details for the file ailake-0.0.16-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: ailake-0.0.16-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f5e3d6fcf67a45f3c0e51239fd05af9fbb38a0d4d1eb5373c0f418834631db3
|
|
| MD5 |
14811bf58ee12cba66cbfb4f6fb3cc9e
|
|
| BLAKE2b-256 |
aa3b6c398d2de2a8601fb2ac92821877b4f1256ee6d3170d39ba6361f19526e1
|
File details
Details for the file ailake-0.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: ailake-0.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.8 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bd891d041f8617d6e9df32d904e78b9e2cce911bffafb06bd4ba4aaca38d8df
|
|
| MD5 |
05a09adab07cba081eed6b34f7a33046
|
|
| BLAKE2b-256 |
3fec97bac260af07562d5b4fe99b5eb408e021fe753d17109837c58dff076817
|
File details
Details for the file ailake-0.0.16-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: ailake-0.0.16-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 3.5 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbe74f1665e374c3085d8c8c76a152cfb2c8fca0bc3b9756233080cbddad4da7
|
|
| MD5 |
22452272bd158f523fabff80e8726737
|
|
| BLAKE2b-256 |
c972a76e2ea974499883c64a0684af0fcd1502a9ceb02a96ae9c4778b6465246
|