Skip to main content

LangChain integration for Apache Iceberg with native PyIceberg API support

Project description

LangChain Iceberg Toolkit

PyPI version Python 3.10+ License: Apache 2.0

A native LangChain integration for Apache Iceberg that enables AI-powered natural language queries over your data lakes. Built with PyIceberg for direct API access (not SQL strings) and featuring Iceberg-specific capabilities like time-travel, snapshots, and partition-aware queries.

Features

  • 🚀 Native PyIceberg Integration - Direct API access, not SQL strings
  • 🔍 Iceberg-Specific Tools - Snapshots, time-travel, partition-aware queries
  • 📊 Optional Semantic Layer - YAML-driven metrics and dimensions
  • 💬 Zero SQL Required - Natural language to Iceberg queries
  • 🏢 Enterprise-Ready - Query limits and timeout protection

Installation

Using pip (standard)

pip install langchain-iceberg

For semantic layer support:

pip install langchain-iceberg[semantic]

Using uv (recommended by LangChain)

uv is a fast Python package installer recommended by LangChain:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install package
uv pip install langchain-iceberg

# With semantic layer
uv pip install "langchain-iceberg[semantic]"

See INSTALL_WITH_UV.md for more details.

Quick Start

Basic Usage

from langchain_iceberg import IcebergToolkit
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent

# Initialize toolkit
toolkit = IcebergToolkit(
    catalog_name="prod",
    catalog_config={
        "type": "rest",
        "uri": "http://localhost:8181",
        "warehouse": "s3://my-warehouse"
    }
)

# Get tools
tools = toolkit.get_tools()

# Create agent
llm = ChatOpenAI(model="gpt-4")
agent = create_react_agent(llm, tools)

# Query with natural language
result = agent.invoke({
    "input": "Show me the top 10 orders by amount from the sales.orders table"
})

print(result)

Direct Tool Usage

from langchain_iceberg import IcebergToolkit

toolkit = IcebergToolkit(
    catalog_name="rest",
    catalog_config={
        "type": "rest",
        "uri": "http://localhost:8181",
        "warehouse": "s3://warehouse/wh/"
    }
)

tools = toolkit.get_tools()

# Use tools directly
list_ns = next(t for t in tools if t.name == "iceberg_list_namespaces")
namespaces = list_ns.run({})
print(namespaces)

query = next(t for t in tools if t.name == "iceberg_query")
results = query.run({
    "table_id": "test.orders",
    "filters": "status = 'completed'",
    "limit": 10
})
print(results)

With Semantic Layer

# Load semantic YAML for business-friendly metrics
toolkit = IcebergToolkit(
    catalog_name="prod",
    catalog_config={...},
    semantic_yaml="s3://bucket/semantic.yaml"
)

tools = toolkit.get_tools()
# Now includes auto-generated metric tools like get_total_revenue, get_order_count, etc.

agent = create_react_agent(llm, tools)

# Business question (no SQL needed!)
result = agent.invoke({
    "input": "What was Q4 2024 revenue by customer segment?"
})

Time-Travel Queries

# Query historical data
result = agent.invoke({
    "input": "Compare this month's revenue to the same period last year using time-travel"
})

Available Tools

The toolkit provides the following tools:

Catalog Exploration

  • iceberg_list_namespaces - List all namespaces in the catalog
  • iceberg_list_tables - List tables in a namespace
  • iceberg_get_schema - Get table schema with sample data

Query Execution

  • iceberg_query - Execute queries with filters and column selection
  • iceberg_plan_query - LLM-assisted query planning

Time-Travel (Iceberg-Specific)

  • iceberg_snapshots - List table snapshots
  • iceberg_time_travel - Query data at a specific point in time

Semantic Layer (Auto-Generated)

  • get_{metric_name} - Auto-generated tools from YAML metrics

Documentation

Requirements

  • Python 3.10+
  • Apache Iceberg catalog (REST, Hive, Glue, or Nessie)
  • Cloud storage (S3, ADLS, or GCS)

Contributing

Contributions are welcome! Please see our Contributing Guide for details.

License

Apache 2.0 License - see LICENSE file for details.

Support

Roadmap

  • Core toolkit with catalog exploration
  • Query execution tools
  • Time-travel and snapshot tools
  • Semantic layer with YAML support
  • Governance features (access control, PII protection) - Planned for future release
  • Query planner tool

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_iceberg-0.1.1.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_iceberg-0.1.1-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file langchain_iceberg-0.1.1.tar.gz.

File metadata

  • Download URL: langchain_iceberg-0.1.1.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for langchain_iceberg-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6e8cafab7c3c01b10d04238125ffbcb8c848ab10918ff732db247177deab253d
MD5 d4a356fdc5b76b508dbc5efb6ea03b73
BLAKE2b-256 dc98d1e5a3472a2491dfd141411c189287a09f4f41eee7178e1b2f9272c0272a

See more details on using hashes here.

File details

Details for the file langchain_iceberg-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_iceberg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aa19f90f4ddacbeeb78af7920eef5b50f873cc47914bfa1d53995ea78d6845a3
MD5 3d0bcb529e38d4b5a70ba28e6e067e13
BLAKE2b-256 ed73f14966156f6ce6f03c95623b59230b67d0fd1b663923caddf028114f83bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page