Skip to main content

An integration package connecting ClickZetta and LangChain

Project description

LangChain ClickZetta

Python 3.9+ License: MIT

🚀 Enterprise-grade LangChain integration for ClickZetta - Unlock the power of cloud-native lakehouse with AI-driven SQL queries, high-performance vector search, and intelligent full-text retrieval in a unified platform.

📖 Table of Contents

🚀 Why ClickZetta + LangChain?

🏆 Unique Advantages

1. Native Lakehouse Architecture

  • ClickZetta's cloud-native lakehouse provides 10x performance improvement over traditional Spark-based architectures
  • Unified storage and compute for all data types (structured, semi-structured, unstructured)
  • Real-time incremental processing capabilities

2. True Hybrid Search in Single Table

  • Industry-first single-table hybrid search combining vector and full-text indexes
  • No complex joins or multiple tables needed - everything in one place
  • Atomic MERGE operations for consistent data updates

3. Enterprise-Grade Storage Services

  • Complete LangChain BaseStore implementation with sync/async support
  • Native Volume integration for binary file storage (models, embeddings)
  • SQL-queryable document storage with JSON metadata
  • Atomic UPSERT operations using ClickZetta's MERGE INTO

4. Advanced Chinese Language Support

  • Built-in Chinese text analyzers (IK, standard, keyword)
  • Optimized for bilingual (Chinese/English) AI applications
  • DashScope integration for state-of-the-art Chinese embeddings

5. Production-Ready Features

  • Connection pooling and query optimization
  • Comprehensive error handling and logging
  • Full test coverage (unit + integration)
  • Type-safe operations throughout

🛠️ Core Features

🧠 AI-Powered Query Interface

  • Natural Language to SQL: Convert questions to optimized ClickZetta SQL
  • Context-Aware: Understands table schemas and relationships
  • Bilingual Support: Works seamlessly with Chinese and English queries

🔍 Advanced Search Capabilities

  • Vector Search: High-performance embedding-based similarity search
  • Full-Text Search: Enterprise-grade inverted index with multiple analyzers
  • True Hybrid Search: Single-table combined vector + text search (industry first)
  • Metadata Filtering: Complex filtering with JSON metadata support

💾 Enterprise Storage Solutions

  • ClickZettaStore: High-performance key-value storage using SQL tables
  • ClickZettaDocumentStore: Structured document storage with queryable metadata
  • ClickZettaFileStore: Binary file storage using native ClickZetta Volume
  • ClickZettaVolumeStore: Direct Volume integration for maximum performance

🔄 Production-Grade Operations

  • Atomic UPSERT: MERGE INTO operations for data consistency
  • Batch Processing: Efficient bulk operations for large datasets
  • Connection Management: Pooling and automatic reconnection
  • Type Safety: Full type annotations and runtime validation

🎯 LangChain Compatibility

  • BaseStore Interface: 100% compatible with LangChain storage standards
  • Async Support: Full async/await pattern implementation
  • Chain Integration: Seamless integration with LangChain chains and agents
  • Memory Systems: Persistent chat history and conversation memory

Installation

From PyPI (Recommended)

pip install langchain-clickzetta

Development Installation

git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install -e ".[dev]"

Local Installation from Source

git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install .

Quick Start

Basic Setup

from langchain_clickzetta import ClickZettaEngine, ClickZettaSQLChain, ClickZettaVectorStore
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.llms import Tongyi

# Initialize ClickZetta engine
# ClickZetta requires exactly 7 connection parameters
engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster"  
)

# Initialize embeddings (DashScope recommended for Chinese/English support)
embeddings = DashScopeEmbeddings(
    dashscope_api_key="your-dashscope-api-key",
    model="text-embedding-v4"
)

# Initialize LLM
llm = Tongyi(dashscope_api_key="your-dashscope-api-key")

SQL Queries

# Create SQL chain
sql_chain = ClickZettaSQLChain.from_engine(
    engine=engine,
    llm=llm,
    return_sql=True
)

# Ask questions in natural language
result = sql_chain.invoke({
    "query": "How many users do we have in the database?"
})

print(result["result"])  # Natural language answer
print(result["sql_query"])  # Generated SQL query

Vector Storage

from langchain_core.documents import Document

# Create vector store
vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    table_name="my_vectors",
    vector_element_type="float"  # Options: float, int, tinyint
)

# Add documents
documents = [
    Document(
        page_content="ClickZetta is a high-performance analytics database.",
        metadata={"category": "database", "type": "analytics"}
    ),
    Document(
        page_content="LangChain enables building applications with LLMs.",
        metadata={"category": "framework", "type": "ai"}
    )
]

vector_store.add_documents(documents)

# Search for similar documents
results = vector_store.similarity_search(
    "What is ClickZetta?",
    k=2
)

for doc in results:
    print(doc.page_content)

Full-text Search

from langchain_clickzetta.retrievers import ClickZettaFullTextRetriever

# Create full-text retriever
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="my_documents",
    search_type="phrase",
    k=5
)

# Add documents to search index
retriever.add_documents(documents)

# Search documents
results = retriever.get_relevant_documents("ClickZetta database")
for doc in results:
    print(f"Score: {doc.metadata.get('relevance_score', 'N/A')}")
    print(f"Content: {doc.page_content}")

True Hybrid Search (Single Table)

from langchain_clickzetta import ClickZettaHybridStore, ClickZettaUnifiedRetriever

# Create true hybrid store (single table with both vector + inverted indexes)
hybrid_store = ClickZettaHybridStore(
    engine=engine,
    embeddings=embeddings,
    table_name="hybrid_docs",
    text_analyzer="ik",  # Chinese text analyzer
    distance_metric="cosine"
)

# Add documents to hybrid store
documents = [
    Document(page_content="云器 Lakehouse 是由云器科技完全自主研发的新一代云湖仓。使用增量计算的数据计算引擎,性能可以提升至传统开源架构例如Spark的 10倍,实现了海量数据的全链路-低成本-实时化处理,为AI 创新提供了支持全类型数据整合、存储与计算的平台,帮助企业从传统的开源 Spark 体系升级到 AI 时代的数据基础设施。"),
    Document(page_content="LangChain enables building LLM applications")
]
hybrid_store.add_documents(documents)

# Create unified retriever for hybrid search
retriever = ClickZettaUnifiedRetriever(
    hybrid_store=hybrid_store,
    search_type="hybrid",  # "vector", "fulltext", or "hybrid"
    alpha=0.5,  # Balance between vector and full-text search
    k=5
)

# Search using hybrid approach
results = retriever.invoke("analytics database")
for doc in results:
    print(f"Content: {doc.page_content}")

Chat Message History

from langchain_clickzetta import ClickZettaChatMessageHistory
from langchain_core.messages import HumanMessage, AIMessage

# Create chat history
chat_history = ClickZettaChatMessageHistory(
    engine=engine,
    session_id="user_123",
    table_name="chat_sessions"
)

# Add messages
chat_history.add_message(HumanMessage(content="Hello!"))
chat_history.add_message(AIMessage(content="Hi there! How can I help you?"))

# Retrieve conversation history
messages = chat_history.messages
for message in messages:
    print(f"{message.__class__.__name__}: {message.content}")

Configuration

Environment Variables

You can configure ClickZetta connection using environment variables:

export CLICKZETTA_SERVICE="your-service"
export CLICKZETTA_INSTANCE="your-instance"
export CLICKZETTA_WORKSPACE="your-workspace"
export CLICKZETTA_SCHEMA="your-schema"
export CLICKZETTA_USERNAME="your-username"
export CLICKZETTA_PASSWORD="your-password"
export CLICKZETTA_VCLUSTER="your-vcluster"  # Required

Connection Options

engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster",  # Required parameter
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=300,         # Query timeout in seconds
    hints={                    # Custom query hints
        "sdk.job.timeout": 600,
        "query_tag": "My Application"
    }
)

Advanced Usage

Custom SQL Prompts

from langchain_core.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    input_variables=["input", "table_info", "dialect"],
    template="""
    You are a ClickZetta SQL expert. Given the input question and table information,
    write a syntactically correct {dialect} query.

    Tables: {table_info}
    Question: {input}

    SQL Query:"""
)

sql_chain = ClickZettaSQLChain(
    engine=engine,
    llm=llm,
    sql_prompt=custom_prompt
)

Vector Store with Custom Distance Metrics

vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    distance_metric="euclidean",  # or "cosine", "manhattan"
    vector_dimension=1536,
    vector_element_type="float"  # or "int", "tinyint"
)

Metadata Filtering

# Search with metadata filters
results = vector_store.similarity_search(
    "machine learning",
    k=5,
    filter={"category": "tech", "year": 2024}
)

# Full-text search with metadata
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="research_docs"
)
results = retriever.get_relevant_documents(
    "artificial intelligence",
    filter={"type": "research"}
)

Testing

Run the test suite:

# Navigate to package directory
cd libs/clickzetta

# Install test dependencies
pip install -e ".[dev]"

# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run all tests
make test

Integration Tests

To run integration tests against a real ClickZetta instance:

  1. Configure your connection in ~/.clickzetta/connections.json with a UAT connection
  2. Add DashScope API key to the configuration
  3. Run integration tests:
cd libs/clickzetta
make integration
make integration-dashscope

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks (if configured)
pre-commit install

Code Quality

# Navigate to the package directory
cd libs/clickzetta

# Format code (auto-fixes many issues)
make format

# Linting (significantly improved)
make lint      # ✅ Reduced from 358 to 65 errors - 82% improvement!

# Core functionality testing
# Use project virtual environment for best results:
source .venv/bin/activate
make test-unit        # ✅ Core unit tests (LangChain compatibility verified)
make test-integration # Integration tests

# Type checking (in progress)
make typecheck # Some LangChain compatibility issues being resolved

Recent Improvements ✨:

  • Ruff configuration updated to modern format
  • 155 typing issues auto-fixed (Dict→dict, Optional→|None)
  • Method signatures fixed for LangChain BaseStore compatibility
  • Bare except clauses improved with proper exception handling
  • Code formatting standardized with black

Current Status: Core functionality fully working with significantly improved code quality (82% reduction in lint errors). All LangChain BaseStore compatibility tests pass.

📦 Storage Services

LangChain ClickZetta provides comprehensive storage services that implement the LangChain BaseStore interface with enterprise-grade features:

🔑 Key Advantages of ClickZetta Storage

🚀 Performance Benefits

  • 10x Faster: ClickZetta's optimized lakehouse architecture
  • Atomic Operations: MERGE INTO for consistent UPSERT operations
  • Batch Processing: Efficient handling of large datasets
  • Connection Pooling: Optimized database connections

🏗️ Architecture Benefits

  • Native Integration: Direct ClickZetta Volume support for binary data
  • SQL Queryability: Full SQL access to stored documents and metadata
  • Unified Storage: Single platform for all data types
  • Schema Evolution: Flexible metadata storage with JSON support

🔒 Enterprise Features

  • ACID Compliance: Full transaction support
  • Type Safety: Runtime validation and type checking
  • Error Handling: Comprehensive error recovery and logging
  • Monitoring: Built-in query performance tracking

Key-Value Store

from langchain_clickzetta import ClickZettaStore

# Basic key-value storage
store = ClickZettaStore(engine=engine, table_name="cache")
store.mset([("key1", b"value1"), ("key2", b"value2")])
values = store.mget(["key1", "key2"])

Document Store

from langchain_clickzetta import ClickZettaDocumentStore

# Document storage with metadata
doc_store = ClickZettaDocumentStore(engine=engine, table_name="documents")
doc_store.store_document("doc1", "content", {"author": "user"})
content, metadata = doc_store.get_document("doc1")

File Store

from langchain_clickzetta import ClickZettaFileStore

# Binary file storage using ClickZetta Volume
file_store = ClickZettaFileStore(
    engine=engine,
    volume_type="user",
    subdirectory="models"
)
file_store.store_file("model.bin", binary_data, "application/octet-stream")
content, mime_type = file_store.get_file("model.bin")

Volume Store (Native ClickZetta Volume)

from langchain_clickzetta import ClickZettaUserVolumeStore

# Native Volume integration
volume_store = ClickZettaUserVolumeStore(engine=engine, subdirectory="data")
volume_store.mset([("config.json", b'{"key": "value"}')])
config = volume_store.mget(["config.json"])[0]

📊 Comparison with Alternatives

ClickZetta vs. Traditional Vector Databases

Feature ClickZetta + LangChain Pinecone/Weaviate Chroma/FAISS
Hybrid Search ✅ Single table ❌ Multiple systems ❌ Separate tools
SQL Queryability ✅ Full SQL support ❌ Limited ❌ No SQL
Lakehouse Integration ✅ Native ❌ External ❌ External
Chinese Language ✅ Optimized ⚠️ Basic ⚠️ Basic
Enterprise Features ✅ ACID, Transactions ⚠️ Limited ❌ Basic
Storage Services ✅ Full LangChain API ❌ Custom ❌ Limited
Performance ✅ 10x improvement ⚠️ Variable ⚠️ Memory limited

ClickZetta vs. Other LangChain Integrations

Integration Vector Search Full-Text Hybrid Storage API SQL Queries
ClickZetta
Elasticsearch ⚠️
PostgreSQL/pgvector ⚠️ ⚠️
MongoDB ⚠️ ⚠️
Redis

Key Differentiators

🎯 Single Platform Solution

  • No need to manage multiple systems (vector DB + full-text + SQL + storage)
  • Unified data governance and security model
  • Simplified architecture and reduced operational complexity

🚀 Performance at Scale

  • ClickZetta's incremental computing engine
  • Optimized for both analytical and operational workloads
  • Native lakehouse storage with separation of compute and storage

🌏 Chinese Market Focus

  • Deep integration with Chinese AI ecosystem (DashScope, Tongyi)
  • Optimized text processing for Chinese language
  • Compliance with Chinese data regulations

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for your changes
  5. Ensure all tests pass (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_clickzetta-0.1.4.tar.gz (95.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_clickzetta-0.1.4-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_clickzetta-0.1.4.tar.gz.

File metadata

  • Download URL: langchain_clickzetta-0.1.4.tar.gz
  • Upload date:
  • Size: 95.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for langchain_clickzetta-0.1.4.tar.gz
Algorithm Hash digest
SHA256 0273a742cb647781c2bb74dece01ea4515aa7dd4cd82088284399170dbbb4f18
MD5 08c0563f24ece0c940b3bf36625d4379
BLAKE2b-256 38ad99834615f689b2afc5595429e82a166c72db627d8e8f3db2c9f72b1625d8

See more details on using hashes here.

File details

Details for the file langchain_clickzetta-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_clickzetta-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0d46d660e3aa8b4e44a716da24db1cd6ad3433e9f0f6d3199784e0f10910c7d0
MD5 1d3ed88efb4955acd2b9705e78d007a8
BLAKE2b-256 5bc55804fed27673ece028b1355c0285ecb2adcc4cfa936e3258194f39a78b99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page