Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write

import ailake
import numpy as np

writer = ailake.TableWriter(
    path="./my_table",
    vector_column="embedding",  # default
    dim=1536,                   # default
    metric="cosine",            # cosine | euclidean | dot_product
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32).tolist()

writer.write_batch(texts=texts, embeddings=embeddings)
snapshot_id = writer.commit()

Search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32).tolist()

results = ailake.search(
    path="./my_table",
    query=query,
    top_k=10,
)

for r in results:
    print(r["row_id"], r["distance"], r["file"])

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)

# Pass context_xml directly to Claude / GPT-4 as a user message

API

TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")

Opens or creates an AI-Lake table at path. Local filesystem only in this release.

Method Description
write_batch(texts, embeddings) Stage a batch of rows. texts: list[str], embeddings: list[list[float]]
commit() -> int Commit staged batches as a new Iceberg snapshot. Returns snapshot ID.

search(path, query, top_k=10) -> list[dict]

Returns up to top_k nearest neighbours. Each result: {"row_id": int, "distance": float, "file": str}.

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) -> str

Assembles a list of chunk dicts into structured XML ready for LLM input. Deduplicates near-identical chunks and respects the token budget.

Iceberg compatibility

Tables written by ailake are valid Apache Iceberg Spec v2 tables. Any Iceberg-compatible engine (Spark, Trino, DuckDB, PyIceberg) reads the tabular columns normally. The HNSW index lives in an AI-Lake extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.10.tar.gz (155.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.10-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.10-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.10-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.10.tar.gz.

File metadata

  • Download URL: ailake-0.0.10.tar.gz
  • Upload date:
  • Size: 155.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.10.tar.gz
Algorithm Hash digest
SHA256 e1e4d728d1940beafafd9ca2122f5476a47303d4e9bca8a80434a3baada453c6
MD5 a3d30ee480506476f754acfe6354a789
BLAKE2b-256 c3d868424061aae34c05a64639ec744b006128c4f690c4529bcc3c2634761d88

See more details on using hashes here.

File details

Details for the file ailake-0.0.10-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.10-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for ailake-0.0.10-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a3ec9337bf4b889d0b55984ad058643cbf608070ae811e5a29b7328c37de8b40
MD5 89b329100111bdb0f5b43e93ebfbab9a
BLAKE2b-256 c3574cca761c8b3b5c1d0472c10350e04af57fc28fffca3f3debee9102ba8c8a

See more details on using hashes here.

File details

Details for the file ailake-0.0.10-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.10-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fe2769afd941f1a6152ae0c45dd9d02199987dc4bcc5494ca4b0e81600983b2b
MD5 fc77f50220c14ba2e2fb7e48f9b56a52
BLAKE2b-256 6921ec610ec60c05895145dab58dcd450cc35e405d3a4fa4ea4b6e34fac4b8cd

See more details on using hashes here.

File details

Details for the file ailake-0.0.10-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.10-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a7fbf2103883d7bc3337d071229622c1a556c9cb2256729149467bc56ead4225
MD5 82aafada0e44dde0ca8423fce9df214e
BLAKE2b-256 210220272a18e0d687e34b1601fb899fdb6efd9311193ee9acccf1c38d2f3348

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page