Skip to main content

Unified tabular + vector storage in a single Iceberg-compatible file

Project description

ailake — AI-Lake Format Python SDK

Unified storage for tabular data, embeddings, and HNSW vector index in a single Parquet-compatible file. 100% Apache Iceberg Spec v2 compatible.

Install

pip install ailake

Requires Python ≥ 3.9. Dependencies: pyarrow >= 14.0, numpy >= 1.24.

Quickstart

Write

import ailake
import numpy as np

writer = ailake.TableWriter(
    path="./my_table",
    vector_column="embedding",  # default
    dim=1536,                   # default
    metric="cosine",            # cosine | euclidean | dot_product
)

texts = ["Document about AI", "Another document"]
embeddings = np.random.rand(2, 1536).astype(np.float32).tolist()

writer.write_batch(texts=texts, embeddings=embeddings)
snapshot_id = writer.commit()

Search

import ailake
import numpy as np

query = np.random.rand(1536).astype(np.float32).tolist()

results = ailake.search(
    path="./my_table",
    query=query,
    top_k=10,
)

for r in results:
    print(r["row_id"], r["distance"], r["file"])

Assemble context for LLMs

import ailake

chunks = [
    {
        "document_id": "doc-1",
        "chunk_index": 0,
        "chunk_text": "AI-Lake stores vectors and tabular data together.",
        "document_title": "AI-Lake Overview",
        "section_path": "Introduction",
        "source_uri": "s3://my-lake/docs/overview.pdf",
        "distance": 0.12,
    },
]

context_xml = ailake.assemble_context(
    chunks=chunks,
    max_tokens=4096,       # token budget (4 chars ≈ 1 token)
    dedup_threshold=0.05,  # drop near-duplicate chunks
)

# Pass context_xml directly to Claude / GPT-4 as a user message

API

TableWriter(path, vector_column="embedding", dim=1536, metric="cosine")

Opens or creates an AI-Lake table at path. Local filesystem only in this release.

Method Description
write_batch(texts, embeddings) Stage a batch of rows. texts: list[str], embeddings: list[list[float]]
commit() -> int Commit staged batches as a new Iceberg snapshot. Returns snapshot ID.

search(path, query, top_k=10) -> list[dict]

Returns up to top_k nearest neighbours. Each result: {"row_id": int, "distance": float, "file": str}.

assemble_context(chunks, max_tokens=4096, dedup_threshold=0.05) -> str

Assembles a list of chunk dicts into structured XML ready for LLM input. Deduplicates near-identical chunks and respects the token budget.

Iceberg compatibility

Tables written by ailake are valid Apache Iceberg Spec v2 tables. Any Iceberg-compatible engine (Spark, Trino, DuckDB, PyIceberg) reads the tabular columns normally. The HNSW index lives in an AI-Lake extension section that standard Parquet readers silently ignore.

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ailake-0.0.8.tar.gz (154.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ailake-0.0.8-cp39-abi3-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

ailake-0.0.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

ailake-0.0.8-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

File details

Details for the file ailake-0.0.8.tar.gz.

File metadata

  • Download URL: ailake-0.0.8.tar.gz
  • Upload date:
  • Size: 154.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for ailake-0.0.8.tar.gz
Algorithm Hash digest
SHA256 f00a7fa44ef5784700264b65d1c92201b12e6228d98fbd37f7e413bfcb044e16
MD5 c8a76b144cdfaa0167e6d735ab446b02
BLAKE2b-256 9d92cfb428c2f6ada1bdadfad94b49543f68f848e4ab5d8ced0a157e700354b4

See more details on using hashes here.

File details

Details for the file ailake-0.0.8-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: ailake-0.0.8-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for ailake-0.0.8-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bf789516eb66d8a96fc634437a6c7328c36c1bc53f34efb1514304718f22a502
MD5 bb985ad89aaf1d451a4e11bb0d43cca8
BLAKE2b-256 099c3dead53fbb050be777df459a04a4e5499a79fe9002b05d8f9eaffbe2e5c5

See more details on using hashes here.

File details

Details for the file ailake-0.0.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for ailake-0.0.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 970e7c270cde7cde690e7c26ac3ee65b33703f38376d69fd64adea117f18e789
MD5 b23df16aec82456535de4bae4b84aecf
BLAKE2b-256 4f79184b34755afe43df7bfc2b2ddfd3e7492cd8b12a4b36fdf1443550274958

See more details on using hashes here.

File details

Details for the file ailake-0.0.8-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for ailake-0.0.8-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 41472fcdb4e7d446a2dd0995dbf1c44e597da8d64f3ef35286eaeaace8dd3b60
MD5 3f5194e1e8705fb6e5fc82c330d27415
BLAKE2b-256 7a5cf2227a77c20b94014bff8c3d10d384988f13d24a7a0821bc80cf4e03cb86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page