Skip to main content

LangChain integration for Apache Iceberg with native PyIceberg API support

Project description

LangChain Iceberg Toolkit

PyPI version Python 3.10+ License: Apache 2.0

A native LangChain integration for Apache Iceberg that enables AI-powered natural language queries over your data lakes. Built with PyIceberg for direct API access (not SQL strings) and featuring Iceberg-specific capabilities like time-travel, snapshots, and partition-aware queries.

Features

  • 🚀 Native PyIceberg Integration - Direct API access, not SQL strings
  • 🔍 Iceberg-Specific Tools - Snapshots, time-travel, partition-aware queries
  • 📊 Optional Semantic Layer - YAML-driven metrics and dimensions
  • 💬 Zero SQL Required - Natural language to Iceberg queries
  • 🏢 Enterprise-Ready - Query limits and timeout protection

Installation

Using pip (standard)

pip install langchain-iceberg

For semantic layer support:

pip install langchain-iceberg[semantic]

Using uv (recommended by LangChain)

uv is a fast Python package installer recommended by LangChain:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install package
uv pip install langchain-iceberg

# With semantic layer
uv pip install "langchain-iceberg[semantic]"

See INSTALL_WITH_UV.md for more details.

Quick Start

Basic Usage

from langchain_iceberg import IcebergToolkit
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent

# Initialize toolkit
toolkit = IcebergToolkit(
    catalog_name="prod",
    catalog_config={
        "type": "rest",
        "uri": "http://localhost:8181",
        "warehouse": "s3://my-warehouse"
    }
)

# Get tools
tools = toolkit.get_tools()

# Create agent
llm = ChatOpenAI(model="gpt-4")
agent = create_react_agent(llm, tools)

# Query with natural language
result = agent.invoke({
    "input": "Show me the top 10 orders by amount from the sales.orders table"
})

print(result)

Direct Tool Usage

from langchain_iceberg import IcebergToolkit

toolkit = IcebergToolkit(
    catalog_name="rest",
    catalog_config={
        "type": "rest",
        "uri": "http://localhost:8181",
        "warehouse": "s3://warehouse/wh/"
    }
)

tools = toolkit.get_tools()

# Use tools directly
list_ns = next(t for t in tools if t.name == "iceberg_list_namespaces")
namespaces = list_ns.run({})
print(namespaces)

query = next(t for t in tools if t.name == "iceberg_query")
results = query.run({
    "table_id": "test.orders",
    "filters": "status = 'completed'",
    "limit": 10
})
print(results)

With Semantic Layer

# Load semantic YAML for business-friendly metrics
toolkit = IcebergToolkit(
    catalog_name="prod",
    catalog_config={...},
    semantic_yaml="s3://bucket/semantic.yaml"
)

tools = toolkit.get_tools()
# Now includes auto-generated metric tools like get_total_revenue, get_order_count, etc.

agent = create_react_agent(llm, tools)

# Business question (no SQL needed!)
result = agent.invoke({
    "input": "What was Q4 2024 revenue by customer segment?"
})

Time-Travel Queries

# Query historical data
result = agent.invoke({
    "input": "Compare this month's revenue to the same period last year using time-travel"
})

Available Tools

The toolkit provides the following tools:

Catalog Exploration

  • iceberg_list_namespaces - List all namespaces in the catalog
  • iceberg_list_tables - List tables in a namespace
  • iceberg_get_schema - Get table schema with sample data

Query Execution

  • iceberg_query - Execute queries with filters and column selection
  • iceberg_plan_query - LLM-assisted query planning

Time-Travel (Iceberg-Specific)

  • iceberg_snapshots - List table snapshots
  • iceberg_time_travel - Query data at a specific point in time

Semantic Layer (Auto-Generated)

  • get_{metric_name} - Auto-generated tools from YAML metrics

Documentation

Requirements

  • Python 3.10+
  • Apache Iceberg catalog (REST, Hive, Glue, or Nessie)
  • Cloud storage (S3, ADLS, or GCS)

Contributing

Contributions are welcome! Please see our Contributing Guide for details.

License

Apache 2.0 License - see LICENSE file for details.

Support

Roadmap

  • Core toolkit with catalog exploration
  • Query execution tools
  • Time-travel and snapshot tools
  • Semantic layer with YAML support
  • Governance features (access control, PII protection) - Planned for future release
  • Query planner tool

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_iceberg-0.1.2.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_iceberg-0.1.2-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file langchain_iceberg-0.1.2.tar.gz.

File metadata

  • Download URL: langchain_iceberg-0.1.2.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for langchain_iceberg-0.1.2.tar.gz
Algorithm Hash digest
SHA256 77961548cdd583c964a8ea10a7d1ec7f07354550ee2c14e410d0b17e0ccc7516
MD5 9bbe44b9aca995cea8258de1931de082
BLAKE2b-256 952f980434c5f4442aedbb0a21ff309b99dc112d5063927ae7de8d15b58410c3

See more details on using hashes here.

File details

Details for the file langchain_iceberg-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_iceberg-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 482291d6da051ee37a11240fd8540605086949db7a17c31ff9166ecf739015c2
MD5 5597240b3ee8398042660772d66a4dac
BLAKE2b-256 952466dc224dcc552cd58b5c885d3b26109bd42ff1d37ffd2f40bdff541c8b21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page